Open tunnckoCore opened 8 years ago
Hm. It is some kind of universal tokenizer? I like the idea of unversal solution.
But it will be hard to migrate PostCSS to it. For my experience, it is better to create special tokenizer for specific languages.
For example, comment
token will be different in different language. In CSS we have even different context — comment could be inside a function (color()
), but not inside special “function” url()
.
And comments is the most important part of tokenizer. If we will put comments tokenizing into parser, it will remove all benefits from tokenizer—parser separation.
Ouh, seems like I miss that this tokenizer support plugins. It makes it 2 times more interesting! :).
What about performance? Tokenizer is a slowest part in any parser.
oh, forgot to ping the tokenize/lexer/parser guru @wooorm, creator of few awesome things such as parse-latin
, parse-english
, retext
and remark
and a few AST-specs. I think it may be interested.
@ai it won't be hard. In anyway postcss parser again is on per character basis. Performance should be at least the same (really, it depends on what plugins do, they can even don't use regex. in anyway, internally all is just one loop over string), but with ability to decompose thing further with plugins.
Tokenizer is a slowest part in any parser.; it is better to create special tokenizer for specific languages.
Indeed. But this lexer is totally agnostic, it just loop over string and passes you each character, its position and the whole input string - all of them available in each plugin.
You can build any type of parser. That's the idea of this separation - there are 3 processes - lexer (with .tokenize
method returning tokens) for generating tokens, parser (with .parse
method returning AST), and stringifier (with .stringify
method) for consuming the AST and composing the new string.
It's simple and awesome.
In one or another way, your tokenizer must be extracted and in anyway it can be used for CSS, JSON and CSON - as retext
can be used for what you want based on some parser. And if we think further, any parser (i'm about to push it to github in a bit) can then be extended (again with plugins).
As @wooorm doing it. He have one parser for english/latin, which generates some AST (CST, actually), then parsers on top of this parser extends this AST and make more type of node
types - UNIST AST -> NLCST AST and etc.
With separate lexer we can produce what kind of parser/AST we want.
In anyway, I need lexer and parser. In anyway, i'll try to do 2 things - json lexer, parser and stringifier (also started work on https://github.com/postjson/postjson); and semver lexer, parser and stringifier (i really don't like semver
package - its API is awful, I need more flexibility - and if i can accomplish at least same speed and pass the tests i'll make a PR there); worth nothing to port postcss to use limon
and be specifically only for css, then we can merge it to postcss :)
edit: It's going to be great journey! :)
See the other examples - simple, CSV and semver. :)
@ai I'm thinking firstly to extract only the tokenizer in separate repo to play there and make some benchmarks, then all add to benchmark the ported tokenizer. So we can see what will be the diffs. :)
I really like the idea of universal tokenizer, but as inventor you should convince me, that it is possible to make fast tokenizer ;).
If you write some proof-of-concept tokenizer with same performance, I will help you finish it ad we will add it to PostCSS.
If you write some proof-of-concept tokenizer with same performance
Yea that's what i'm talking about, that i'm going to do.
For CSON reusing the CoffeeScript lexer & parser was a pretty important design decision, e.g. we want to match CoffeeScript's syntax which doesn't exist outside of its lexer/parser (it doesn't have an official, spec'd grammar). If we'd migrate away, I think we'd use a full parser generator like PegJS.
Port the awesome PostCSS tokenizer, using plugins. Btw this tokenizer actually may be in help even for CSON guys, meaning they can build the CSON syntax using this tokenizer (I tested few complex structures) which we will create using
limon
and plugins./cc @ai @MoOx @jkrems @balupton @RobLoach @ben-eb @kof