postcore / limon

:lemon: The pluggable JavaScript lexer. :tada: Limon = Lemon
http://www.tunnckocore.tk
MIT License
9 stars 0 forks source link

add css lexer to examples #7

Open tunnckoCore opened 8 years ago

tunnckoCore commented 8 years ago

Port the awesome PostCSS tokenizer, using plugins. Btw this tokenizer actually may be in help even for CSON guys, meaning they can build the CSON syntax using this tokenizer (I tested few complex structures) which we will create using limon and plugins.

/cc @ai @MoOx @jkrems @balupton @RobLoach @ben-eb @kof

ai commented 8 years ago

Hm. It is some kind of universal tokenizer? I like the idea of unversal solution.

But it will be hard to migrate PostCSS to it. For my experience, it is better to create special tokenizer for specific languages.

For example, comment token will be different in different language. In CSS we have even different context — comment could be inside a function (color()), but not inside special “function” url().

And comments is the most important part of tokenizer. If we will put comments tokenizing into parser, it will remove all benefits from tokenizer—parser separation.

ai commented 8 years ago

Ouh, seems like I miss that this tokenizer support plugins. It makes it 2 times more interesting! :).

What about performance? Tokenizer is a slowest part in any parser.

tunnckoCore commented 8 years ago

oh, forgot to ping the tokenize/lexer/parser guru @wooorm, creator of few awesome things such as parse-latin, parse-english, retext and remark and a few AST-specs. I think it may be interested.

@ai it won't be hard. In anyway postcss parser again is on per character basis. Performance should be at least the same (really, it depends on what plugins do, they can even don't use regex. in anyway, internally all is just one loop over string), but with ability to decompose thing further with plugins.

Tokenizer is a slowest part in any parser.; it is better to create special tokenizer for specific languages.

Indeed. But this lexer is totally agnostic, it just loop over string and passes you each character, its position and the whole input string - all of them available in each plugin.

You can build any type of parser. That's the idea of this separation - there are 3 processes - lexer (with .tokenize method returning tokens) for generating tokens, parser (with .parse method returning AST), and stringifier (with .stringify method) for consuming the AST and composing the new string. It's simple and awesome.

In one or another way, your tokenizer must be extracted and in anyway it can be used for CSS, JSON and CSON - as retext can be used for what you want based on some parser. And if we think further, any parser (i'm about to push it to github in a bit) can then be extended (again with plugins). As @wooorm doing it. He have one parser for english/latin, which generates some AST (CST, actually), then parsers on top of this parser extends this AST and make more type of node types - UNIST AST -> NLCST AST and etc.

With separate lexer we can produce what kind of parser/AST we want.

tunnckoCore commented 8 years ago

In anyway, I need lexer and parser. In anyway, i'll try to do 2 things - json lexer, parser and stringifier (also started work on https://github.com/postjson/postjson); and semver lexer, parser and stringifier (i really don't like semver package - its API is awful, I need more flexibility - and if i can accomplish at least same speed and pass the tests i'll make a PR there); worth nothing to port postcss to use limon and be specifically only for css, then we can merge it to postcss :)

edit: It's going to be great journey! :)

tunnckoCore commented 8 years ago

See the other examples - simple, CSV and semver. :)

tunnckoCore commented 8 years ago

@ai I'm thinking firstly to extract only the tokenizer in separate repo to play there and make some benchmarks, then all add to benchmark the ported tokenizer. So we can see what will be the diffs. :)

ai commented 8 years ago

I really like the idea of universal tokenizer, but as inventor you should convince me, that it is possible to make fast tokenizer ;).

If you write some proof-of-concept tokenizer with same performance, I will help you finish it ad we will add it to PostCSS.

tunnckoCore commented 8 years ago

If you write some proof-of-concept tokenizer with same performance

Yea that's what i'm talking about, that i'm going to do.

jkrems commented 8 years ago

For CSON reusing the CoffeeScript lexer & parser was a pretty important design decision, e.g. we want to match CoffeeScript's syntax which doesn't exist outside of its lexer/parser (it doesn't have an official, spec'd grammar). If we'd migrate away, I think we'd use a full parser generator like PegJS.