[parser] Implement full featured CSS parser

alexander-akait commented 1 year ago

Just idea for future and future discussions, maybe we can union and write full featured CSS parser + at-rules/values parser from scratch, I am afraid we can't rewrite postcss due some specific logic (and it will probably take longer), so union around CSS parser will be great for any JS tooling, we can open an issue for this

Shorty about situation:

We have postcss and we have certain problems/issues, which, unfortunately, have not been resolved for a long time, like CSS compliance tokenizer and parser, selectors, at-rules and value parser
There is csstree parser, but it is pretty slow in solving problems
There is rust based parser like lightningcss and swc, but unfortunately they are not quite extensible to support all syntaxes, but probably this is solvable, so it's just a discussion for now
There is csstools with own value and at-rules parser
We have postcss-value-parser, postcss-values-parser and postcss-selector-parser, all of them have rather serious limitations and are not so actively maintained, although they are soling almost all current problems, but when a new syntax appears it is usually a problem, another big problem os postcss design, we need to reparse selectors, values and at-rules in each rule, it is very bad for perfomance (very)
We have to decide will we use JS based tooling or rust tooling (I support any decision, the main thing is to find a point of view that suits everyone)
Combining efforts will allow us to more quickly solve problems and avoid duplication of work
We need to think about AST and structure - we can be align with CSSOM or we can design own AST (yes, it will most likely be identical to CSSOM but with some additional nodes/properties) to be able to do deep analyze and fixes
We need to think about non CSS compliance syntaxes (like sass/less/etc) and how to design it extendable, but, yeah, we can start with only CSS and improve it late, CSS by default is error tolerance, so everyhting what we can't parser will be ListOfComponentValues
Any other thoughts?

Feel free to feedback

I decided to start the problem here, as I think this is the most appropriate place, in the future we may move it or break it into more detailed parts.

ybiquitous commented 1 year ago

@alexander-akait Thanks for raising this parser discussion! This is an exciting topic.

So, do you think this issue covers stylelint/stylelint#5586?

I think also @csstools/* tools like @csstools/media-query-list-parser or @csstools/css-tokenizer by @romainmenke are great achievements in this area.

ybiquitous commented 1 year ago

We have postcss and we have certain problems/issues, which, unfortunately, have not been resolved for a long time, like CSS compliance tokenizer and parser, selectors, at-rules and value parser

Does PostCSS's author recognize these problems? And can we not add parser improvements upstream to PostCSS?

Because Stylelint is currently a part of the PostCSS's ecosystem, I think PostCSS would be best for backward compatibility if PostCSS accepted requests for parser improvements.

romainmenke commented 1 year ago

Thank you for bringing this up and for gathering all this info 🙇

I think performance of PostCSS and the CSS tooling ecosystem build around PostCSS is a complicated subject :)

On the one hand PostCSS itself is really really fast. But the way it is designed means that there will be a lot of duplicate work on selectors, values, at-rule preludes, ...

It also is written in JS, so it's bound by the constraints of JS engines. There are a lot of factors here, but in short, it will never be Rust.

Imho it isn't realistically possible to create a new parser that solves everything :

equally fast or faster than PostCSS
can support non-standard syntaxes (less, scss, css-in-js, ...)
can support an ecosystem of plugins
will still be fast when a lot of plugins are run
can parse CSS in it's entirety
has a user friendly API surface
has a complete and correct Object Model
is written in JavaScript

If the constraint that is most important is performance, then it makes more sense to me that Rust or something similar is chosen as a starting point and that other aspects are sacrificed.

However, what I consider to be the most valuable part of PostCSS is not performance but the community and existing adoption.

There is a very large chance that people already have PostCSS as part of their stack, so the barrier to adding a tool based on PostCSS is low.

There is also very little friction within the active PostCSS community, a lot of people are open to collaborate towards a common goal. ( like this here :) )

Starting a community from scratch around a new toolset is not something I am personally interested in :) Not that what I am or am not interested in should stop anyone

Does PostCSS's author recognize these problems?

Yes : https://github.com/postcss/postcss/issues/1145

It's a known issue that it is a waste that each plugin needs to parse values, selectors, at-rule preludes over and over again.

And can we not add parser improvements upstream to PostCSS?

I tried and "succeeded" : https://github.com/postcss/postcss/pull/1812

I wrote more about why I advised not merge that : https://github.com/postcss/postcss/issues/1145#issuecomment-1397999466

TL;DR; the cost of rolling out that change was too high. It would have fractured the PostCSS ecosystem in a way that it might not recover from.

But it would have made it possible for multiple consumers to parse from an existing token array instead of starting from a string. For a tool that is mostly read heavy like Stylelint it would have meant a serious performance gain.

PostCSS as a host/driver for plugins just works really well and the reason that it is successful is also the reason why we have a performance issue. By hiding a lot of complexity and only exposing a limited Object Model it is much easier to create a simple plugin. But it becomes harder to have a performant "tool chain".

My current approach in postcss-preset-env for better performance is to try and guard each parsing of values, selectors, ... with a small test.

If I want to create a fallback for the ic unit, first check if ic exists as a substring of the value. It might be part of a word like pick and that is fine, but most of the values would be skipped without further parsing.

We might be able to do similar things in Stylelint?
Best to discuss that in it's own issue.

Something that I haven't tried yet, but that I think could work is to cache parsed values.

a shared cache between plugins, packages,...
source string is the cache key
cache value contains
- tokens
- optionally more specialized results (component values, media queries, ...)

Each time you take something out of the cache the entry is removed. If you didn't mutate or produced more useful parsed values you add them back to the cache.

This would be extremely sensitive to bugs and any bug would be hard to fix. But it would allow read heavy tasks to share work.

Also best to discuss this in it's own issue.

My current goal with packages like @csstools/css-tokenizer is to lower the barrier to creating high quality parsers for CSS. I want people to have great tooling, even for modern syntax. I don't want there to be a gap of years between a feature landing in browsers and tooling to catch up.

Because it is unopinionated, follows the CSS specification, and doesn't support non-standard syntax it is also really stable. Either it implements the specification correctly or it has a bug and a bug can always be fixed in a patch release. (we might still do semver major from time to time, but these should be rare)

Many things can also be done at the tokenizer level:

upper vs. lower casing of idents/functions
unknown units
normalizing whitespace
removing comments
...

On top of the tokenizer there are the parser algorithms. They are currently limited and only implement the basics for component values. Ideally we extend these to cover more of the css syntax.

These allow you to do more, because structures like blocks, functions are fully parsed. But there isn't any Object Model specific to your context.

To actually have a useful Object Model another layer is needed, specialized parser which are only invoked when relevant. Things like the media query list parser.

This has a complete Object Model but that is also what makes it massive. There are so many node types in this sub-syntax alone.

We need to think about non CSS compliance syntaxes (like sass/less/etc) and how to design it extendable, but, yeah, we can start with only CSS and improve it late, CSS by default is error tolerance, so everyhting what we can't parser will be ListOfComponentValues

I don't personally use non-standard CSS syntax, everything is plain CSS in a file that has a .css extension. (I am not a frontend developer, so more accurate to say that the team I work in writes plain CSS)

My main reason not to support these is because they do not have a true standards body behind them and that I lack familiarity with these syntaxes.

Correctly following one specification is difficult enough. Also adding support for several syntaxes that do not even have a specification is not something I want to spend my time on.

But having said that, all tools I've created are composable and modular. You can rewrite the token stream to make scss look like css, or have different parsing algorithms and then pass on the result of that to one of the specialized parsers like for media queries.

I want people to be able to re-use the complex and hard parts.

Some questions :

are there specific parts of CSS that lack detailed parsers and that this lack of a parser is blocking specific features?
has anyone done any research on what is fast enough for specific tools (minifiers, bundlers, linters, ...) [^1]

[^1]: Faster is always better, but at some point people don't notice the gains anymore.

ybiquitous commented 1 year ago

@romainmenke Thanks for sharing postcss/postcss#1145. Now I understand the context very well. 👍🏼

ybiquitous commented 1 year ago

@romainmenke I'll try answering your questions as far as I know:

are there specific parts of CSS that lack detailed parsers and that this lack of a parser is blocking specific features?

I don't remember completely, but this project may have some blockers due to insufficient parser libraries.

has anyone done any research on what is fast enough for specific tools (minifiers, bundlers, linters, ...)

Unfortunately, I don't know.

ybiquitous commented 1 year ago

I've tried listing up parser libraries used by Stylelint. Some have almost not maintained 😓

Name	Version	Last published	Unpacked size
`postcss`	8.4.23	Apr 20, 2023	194KB
`postcss-media-query-parser`	0.2.3	Oct 27, 2016	n/a
`postcss-resolve-nested-selector`	0.1.1	Feb 19, 2016	n/a
`postcss-safe-parser`	6.0.0	Jun 14, 2021	5.2KB
`postcss-selector-parser`	6.0.13	May 16, 2023	186KB
`postcss-value-parser`	4.2.0	Nov 29, 2021	27KB
`@csstools/css-parser-algorithms`	2.1.1	Apr 10, 2023	31KB
`@csstools/css-tokenizer`	2.1.1	Apr 10, 2023	59KB
`@csstools/media-query-list-parser`	2.0.4	Apr 10, 2023	122KB
`@csstools/selector-specificity`	2.2.0	Mar 21, 2023	17KB
`css-tree`	2.3.1	Dec 15, 2022	1.2MB

Script used to create the table

```js import { spawnSync } from 'child_process'; const allDeps = JSON.parse( spawnSync('npm', ['view', '--json', 'stylelint@15.6.2', 'dependencies']).stdout.toString(), ); const parserDeps = [ 'postcss', 'postcss-media-query-parser', 'postcss-resolve-nested-selector', 'postcss-safe-parser', 'postcss-selector-parser', 'postcss-value-parser', '@csstools/css-parser-algorithms', '@csstools/css-tokenizer', '@csstools/media-query-list-parser', '@csstools/selector-specificity', 'css-tree', ]; const dateFormat = new Intl.DateTimeFormat('en', { dateStyle: 'medium' }); const sizeFormat = new Intl.NumberFormat('en', { notation: 'compact' }); console.log(`| Name | Version | Last published | Unpacked size |`); console.log(`|:-----|:--------|:---------------|---------------:|`); for (const name of parserDeps) { const version = allDeps[name]; if (!version) { throw new Error(`${name} is not in dependencies`); } let dep = JSON.parse( spawnSync('npm', ['view', '--json', `${name}@${version}`]).stdout.toString(), ); if (Array.isArray(dep)) { dep = dep.at(-1); } const lastPublished = dateFormat.format(new Date(dep.time[dep.version])); const size = dep.dist.unpackedSize ? sizeFormat.format(dep.dist.unpackedSize) + 'B' : 'n/a'; console.log( `| [\`${name}\`](https://www.npmjs.com/package/${name}) | ${dep.version} | ${lastPublished} | ${size} |`, ); } ```

EDIT: This list is at point of Stylelint 15.6.2

ybiquitous commented 1 year ago

Problems with dependent parsers:

needed to replace unmaintained parsers with alternative
needed to notify plugin authors unmaintained/replaced
- https://github.com/stylelint/stylelint/blob/6c85850f1135085f76948beede43cb6933a2cd60/docs/developer-guide/rules.md?plain=1#L96-L101
css-tree is large and may be going to be unmaintained

romainmenke commented 1 year ago

Of that list only these seem immediately problematic to me :

postcss-resolve-nested-selector
postcss-media-query-parser

They have not been updated even when the CSS specifications that are relevant to them have changed years ago.

postcss-value-parser has a few open issues which are hard to fix but these are edge cases, not entire unsupported features. Maybe this one can be handled more on a case by case basis?

css-tree is hard for me to judge the situation. It might be a temporary gap in between active maintenance? Would be good to reach out.

I really like the syntax checking it offers and it's not trivial to re-create this feature.

alexander-akait commented 1 year ago

Oh, there are a lot of messages

Imho it isn't realistically possible to create a new parser that solves everything :

equally fast or faster than PostCSS can support non-standard syntaxes (less, scss, css-in-js, ...) can support an ecosystem of plugins will still be fast when a lot of plugins are run can parse CSS in it's entirety has a user friendly API surface has a complete and correct Object Model is written in JavaScript

I full disagree:

Postcss tokenizer is not CSS compilance, we already have problems with it and trust me in the future there will be more and more of them, and once again I will have to create more hacks, if necessary, I can list and point to everything
Due to lack good tokenizer support we don't have ListOfComponentsValues for declaraiotns, at-rules, selectors and etc
Due to lack above we need to reparse it in the each plugin (very very very perfomance)
Due to lack above we don't have good CSS AST, for example look at Comment Node, working with commnets in postcss is the hell, we have around 9k hacks to make it works (and ability to get their content) and look at babel/acorn comments implementations, no commnets in AST, you can easy undestand trailing and remains comments
It can be extendable, for even on lower level, look at acorn for example and how it implemented, no need
Postcss has not made any improvements for a long time, it just froze at some stage of development
No grammar parsing, so stylelint has 2 CSS parser, one for grammar and other for syntax analize (csstree and postcss)
No ability to associate structures/multiple parsing/grammar parsing/infromation with AST nodes, you need to stringify everything and return

By default CSS tokenizer is error resistance (and CSS parser) too, so we don't need to worry a lot of non standard CSS, because by spec it will be ListOfCompomentsValues if we can't apply grammar.

If the constraint that is most important is performance, then it makes more sense to me that Rust or something similar is chosen as a starting point and that other aspects are sacrificed. However, what I consider to be the most valuable part of PostCSS is not performance but the community and existing adoption. There is a very large chance that people already have PostCSS as part of their stack, so the barrier to adding a tool based on PostCSS is low. There is also very little friction within the active PostCSS community, a lot of people are open to collaborate towards a common goal. ( like this here :) ) Starting a community from scratch around a new toolset is not something I am personally interested in :) Not that what I am or am not interested in should stop anyone

I propose not to parry to emotion, but to return to reality, if the tool is not going to solve problems and does not provide an opportunity to solve them, then it's time to change the tool.

Some questions :

are there specific parts of CSS that lack detailed parsers and that this lack of a parser is blocking specific features? has anyone done any research on what is fast enough for specific tools (minifiers, bundlers, linters, ...) 1

Yes and Yes, But we just have incredible performance issues and bugs

Now let's get back to being more constructive:

By performance, I don't mean that we should have speed like C++ or Rust, it should be acceptable, here is a clear example of the problem - cssnano has around 14-16 plugins under the hood and in almost every we parse selectors and values, same here in stylelint, if this is not a clear performance problem, then I immediately give up
All parsers are divided into small packages, like postcss-value-parser, postcss-selector-parser, postcss-media-parser, postcss--again-again-again-parser, some of them are abandoned, some are simply not maintance, they are not completely coordinated with each other, have different AST and fail to resolve issues promptly
We do not have a normal grammar parser, that's why the code is just full of complex loops and more complex conditions, here is still not a small part of my code, reading and maintaining is a crazy effort, a grammar parser would make it possible to simplify this by at least half
Problems with comments I already mentioned, if you need to get comment like /* i-need-to-ignore-the-next-line */ (it can be in any place), you need do magic things
I've been talking about this for a long time and point out these problems, but there is no movement
It is worth adding that I have already implemented tokenizer and parser on Rust https://github.com/swc-project/swc/blob/main/crates/swc_css_parser/src/lexer/mod.rs, but I am here because I need JS solution and I would like to consolidate our efforts here for stylelint team, @csstools and other teams
I fully understand that this is a big task and we don't need to take and replace everything right now, I would like to add that this is not even possible

That is why I suggest to follow the steps:

Start working with the CSS tokenizer, implement this and test
Union around value/at-rules and selector parser and reuse this tokenizer there
Agree with basic AST Node (where we store positions, how we store it, where we store comments and how and etc)
Improve PostCSS to allow store structures, so multiple plugins can reuse results of parsing structures
Implement general CSS parser (like in spec) and release postcss-new-parser (maybe better name) where we will generate PostCSS AST bug using our tokenizer and parser
Deprecated postcss-value-parser (it already just tokenizer parser, so we don't need it anymore), make postcss-selector-parser and postcss-at-rules-parser (there are multiple parser) like utils for our parser
Focus on grammar parsing and simplify our parsers from above because we have it
Focus on extendable - we can allow override tokenizers steps and parser (they are all in spec, we don't need to invent something new), so we can implement basic support for SCSS/Less/etc, I think we have enough basic structures (like Rule/AtRule/Declartion/etc)
Here we already have own full features parser with grammar parsing, if we want to can start to replace PostCSS and implement tranformers with plugin support (and PostCSS AST support), so postcss plugins will work

Some steps can be split into several, I am fine with it, I would also like to add - I've spent quite a bit of time on a lot of tools and parsers in the postcss ecosystem, and I'm honestly tired, and perhaps this is my last attempt to somehow consolidate all this, if it fails again, I will be upset too much again, ultimately, this will lead to the fact that we will simply lose most of our community in the near future

ybiquitous commented 1 year ago

@alexander-akait What a big challenge! 👍🏼 👍🏼 👍🏼

I totally agree with the JS solution against Rust since there is a big JS/CSS community here.

Additionally, I agree with starting with a CSS tokenizer and value/at-rule/selector/etc parsers. We will be able to try them in the Stylelint codebase easily.

romainmenke commented 1 year ago

By performance, I don't mean that we should have speed like C++ or Rust, it should be acceptable, here is a clear example of the problem - cssnano has around 14-16 plugins under the hood and in almost every we parse selectors and values, same here in stylelint, if this is not a clear performance problem, then I immediately give up

Yeah, the performance issue is absolutely clear, I know it very well :) But my point was more that I don't think users of PostCSS see/experience this problem.

LightningCSS for example is (on the surface) a combo of :

postcss-import
postcss-preset-env
cssnano

Even when being so much faster, people aren't really that interested, they think it is very cool, but very few are switching to it. The cost of switching tools is higher than the cost of waiting a few 100ms, even if 90% of that time is useless re-parsing.

I've spent quite a bit of time on a lot of tools and parsers in the postcss ecosystem, and I'm honestly tired, and perhaps this is my last attempt to somehow consolidate all this, if it fails again, I will be upset too much again

I can understand this, and I feel this too, but this is also exactly why I am hesitant.

How can we do a project like this sustainably?

funding
sufficient maintainers
ease of adoption
...

The tokenizer is not something we have to start all over right? Is there a reason we can not use our existing tokenizer?

https://github.com/csstools/postcss-plugins/tree/main/packages/css-tokenizer#readme

ybiquitous commented 1 year ago

How can we do a project like this sustainably?

Yes, this is really a headache for us. 😓 But at least, I believe we can provide a place where the Stylelint community members can easily join.

Is there a reason we can not use our existing tokenizer?

Personally, I think @csstools/css-tokenizer is a great starting point.

silverwind commented 1 year ago

Would https://github.com/servo/rust-cssparser be suitable to integrate? It's the CSS parser that Firefox uses. Thought its docs do indicate it does not parse into selectors or properties, so it's probably only half a parser.

silverwind commented 1 year ago

We need to think about non CSS compliance syntaxes (like sass/less/etc) and how to design it extendable, but, yeah, we can start with only CSS and improve it late, CSS by default is error tolerance, so everyhting what we can't parser will be ListOfComponentValues

CSS preprocessors are on their way out with CSS now having variables, nesting and color modification. I see no compelling reason anymore to use them.

ybiquitous commented 1 year ago

Would servo/rust-cssparser be suitable to integrate?

It's interesting. But I believe our community may be hard to maintain the Rust code.

CSS preprocessors are on their way out with CSS now having variables, nesting and color modification. I see no compelling reason anymore to use them.

I think it's important to keep backward compatibility and extendability for CSS-like syntaxes (Sass/Less etc.) because there are big communities already. At least, we should allow anyone to extend and customize our new parser for such syntaxes.

silverwind commented 1 year ago

I think it's important to keep backward compatibility and extendability for CSS-like syntaxes (Sass/Less etc.) because there are big communities already. At least, we should allow anyone to extend and customize our new parser for such syntaxes.

One way of supporting preprocessors would be to transpile the Sass/Less code with source maps to CSS, lint the CSS, and then report back the errors with the position obtained through the source map. Maybe this is already how it works with the existing customSyntax option, not sure.

ota-meshi commented 1 year ago

I personally don't think it's a good idea to rely on using source maps. I think autocorrection breaks syntax in most cases.

silverwind commented 1 year ago

Right, --fix would not work via such a sourcemap transformation I assume.

Mouvedia commented 1 year ago

Do we have a flamegraph of node_modules/.bin/jest --runInBand ? In short we need some metrics/profiling first.

romainmenke commented 1 year ago

https://github.com/stylelint/stylelint/blob/main/lib/rules/color-named/index.js#L63-L128

color-named is a good example of a performance issue.

It is eagerly parsing with declaration values with postcss-value-parsers without a fast abort.

It is then walking the value AST and again eagerly parsing with colord.

We also have a color value parser built on top of our tokenizer and parser algorithms : https://github.com/csstools/postcss-plugins/tree/main/packages/css-color-parser#readme

The input to this specialized parser is not a string but component values. So there isn't any expensive serializing and re-parsing to make tools work together.

As many logic as possible can be done first at the token level, than at component values and only when really needed as fully parsed color values.

Each step only does the minimal amount of work.

alexander-akait commented 1 year ago

@romainmenke

The tokenizer is not something we have to start all over right? Is there a reason we can not use our existing tokenizer?

https://github.com/csstools/postcss-plugins/tree/main/packages/css-tokenizer#readme

I am fine with it.

My suggestions are:

move it to own repo, we can still be under csstools org, just to avoid mixing postcss-plugins works and parser works
maybe we can move all parser related things there?
I looked at code and it looks like they are fully CSS compliance tokenizer,
Also I found some memory and perf imromenets (for example we store each character - https://github.com/csstools/postcss-plugins/blob/main/packages/css-tokenizer/src/tokenizer.ts#L92, it means if we will have 3mb of CSS, we will store tokens + 3mb characters, it is not good)
if we want to support Sass/Less/Any custom syntax we need to make it extanable, it is not hard, we should just allow to override tokens logic here https://github.com/csstools/postcss-plugins/blob/main/packages/css-tokenizer/src/tokenizer.ts#L81 and run own function, so this section need to be refactors
Implement callbacks on each token (it is useful for bundlers), to bundle CSS you don't need full AST (in most of cases) and to avoid run loops twice we can implement callback options
because it is typescript we need to verify output, because typescript generates additional code and it has affect performance very well in some cases
comments will be useful in source code with links on CSS spec and descriptions, this is not necessary, but usually developers do it for the convenience (look at acorn/typescript/etc for examples)
we need to move it in the one file, because each import/require degrades start time (it's pretty obvious for parser, on each file Node.js execute fs calls, they cost a time)

Maybe I missed something else but this is not a problem, we can discuss it in the repository if we can all agree

romainmenke commented 1 year ago

move it to own repo, we can still be under csstools org, just to avoid mixing postcss-plugins works and parser works

I don't have ownership, admin or publish permissions for either the github org or the npm org for csstools. Either that needs to change and must be extended at least to you (@alexander-akait) or a different space must be created for this effort.

It might be better to do a clean slate start. (We can transfer existing code, test suites, ...)

I personally prefer to work in a mono repo because that makes it easier to spot regressions. Are you ok with having a single git repository for all tokenizer, parser related work?

I agree on all points of feedback related to the current tokenizer.

ybiquitous commented 1 year ago

@alexander-akait @romainmenke If you wish, providing repositories for parsers etc. under the github.com/stylelint org may be possible.

@stylelint/owners Any thoughts?

ntwb commented 1 year ago

If you wish, providing repositories for parsers etc. under the github.com/stylelint org may be possible. No objections to hosting under the github.com/stylelint org

Would servo/rust-cssparser be suitable to integrate?

It's interesting. But I believe our community may be hard to maintain the Rust code.

This is something to be aware of, historically Stylelint has had difficulty in attracting contributors at various times, it's been at times quite challenging allowing both Stylelint to be extended by other plugins and Stylelint depending on other packages and having this ecosystem maintained

Another consideration is the https://github.com/eslint/rfcs/pull/99

This RFC specifies a plugin format that would allow ESLint plugins to fully define their own languages, effectively expanding ESLint from a JavaScript-focused linter into a more general-purpose linter.

The goal here is to take the boring parts of a linter (file finding, configuration, etc.) and separate that out from the JS-specific parts so no one needs to rebuild the boring parts over and over again.

I've not fully thought through all of this, though if writing new tokenizer/parser and having ESLint under the hood to simplify & streamline the maintenance of the underlying cli and api aspects of Stylelint is worth thinking about also IMHO

ybiquitous commented 1 year ago

@ntwb Thanks for the comment. As you mentioned, Stylelint has needed more maintainers.

I personally think this @alexander-akait's suggestion is great not only for the Stylelint community but also for other JS/CSS communities. However, unfortunately, supporting the challenge under the Stylelint organization may be risky because of that maintainer shortage. 😓

alexander-akait commented 1 year ago

@romainmenke

I personally prefer to work in a mono repo because that makes it easier to spot regressions. Are you ok with having a single git repository for all tokenizer, parser related work?

Yes, of course, tokenizer/parser/traverser/serializer, these are things related to the parser process, so it would be great to have them all in one place.

@ntwb

Another consideration is the https://github.com/eslint/rfcs/pull/99 I've not fully thought through all of this, though if writing new tokenizer/parser and having ESLint under the hood to simplify & streamline the maintenance of the underlying cli and api aspects of Stylelint is worth thinking about also IMHO

It's so funny, because I offered to do this 5 years ago, when we were just starting work, but was refused everywhere, now it's official.

And I proceeded from a simple thing - we should make the core for any linters. CLI logic/rules logic/configuration(s)/ignoring and extending/options for parsers and rules/fixable logic/etc and we had to duplicate all this. And my logic was that we could avoid this, collaborate and combine the work, and now I see how it all came to this. But unfortunately a little late and our code has become more complicated and now it would be quite difficult to rewrite all this (yeah, we can just create a rule and run stylelint inside that rule, but that looks like a big mono and badly configurable rule).

But now we can avoid some mistakes too

JS has https://github.com/estree/estree, so any parsers which follow estree are compatibility and I think we have to do the same, yes it takes a time and I definitely can't do it alone, BUT if we do this, then we will become independent of the parser and its implementation in the future, Rust/JS/Zip/C++/C, whatever you want. I still think that the idea of rewriting everything in Rust is a utopia at this moment (the future is foggy and we do not know what will happen tomorrow, but we can influence it), yes it would be great and it would allow for us to have good perf and many and many, but if we look at the world realistically, we will understand that, unfortunately, there are not so many people who know it, and most our users know only JS (some TS too). But this does not mean that we should not build the right foundation, if we get to this in time, then it will be fine, but for now we can just agree on some documents for AST structures and maybe basic API.

conartist6 commented 1 year ago

Hey I just want to introduce myself. I'm working on a shared parser/linter/formatter core, and it is my explicit goal (and full-time job) to unify what can be unified across this ecosystem. I believe myself to be several (important) steps ahead of ESLint in this regard, and as they have also shown me nothing but indifference it seems that I am their open competitor. My project is still flying under the radar for the moment, but I plan for that to change in a major way, and soon.

romainmenke commented 1 year ago

Might be an interesting read : https://railsatscale.com//2023-06-12-rewriting-the-ruby-parser/

ybiquitous commented 1 year ago

Thanks for sharing the article. I read it. We wish "Universal Parser" for CSS, too!

silverwind commented 1 year ago

The best CSS parser ought the be the one that browsers use. I wonder if Blink's CSS parser could be leveraged 😆.

romainmenke commented 1 year ago

The best CSS parser ought the be the one that browsers use.

Yes and no :)

They are the best because they are extremely well tested and are used in the wild by billions.

But browsers only need to parse CSS for a limited use case. Their parsers don't have to preserve as much debug info (like whitespace or comments).

Those parsers also don't have to support non-standard syntax like scss, less, ....

LightingCSS for example uses Servo's CSS tokenizer/parser and that is what makes it good and extremely fast. But it's also the source of all the limitations of LightingCSS.

LightingCSS can not be used to build a linter because it discards too many tokens.

conartist6 commented 1 year ago

LightingCSS can not be used to build a linter because it discards too many tokens.

This is where I come in! cst-tokens takes the output of an existing parser and uses it to rebuild a tree in which every source character is present in the token stream. Doing this requires defining the syntax of CSS in a cst-tokens parser grammar, but the parser need not be complete: it does not need to know how to resolve ambiguity. The traversal code simply uses the output of the first-pass parser for that purpose. In this way my project's functionality is closely related to that of ungrammar (which you should also look into though I am focused on extensible grammars and they are not).

The cst-tokens CST is also a pure superset of the AST it decorates, and is meant to have all the APIs needed to build any kind of parser, formatter, and linter functionality. It allows comment attachment rules for ambiguous comments to be well-defined, while always preserving the ability to see all possible comment attachments for any given node.

conartist6 commented 1 year ago

Another reason there's a strong case for a concrete syntax wrapper around an existing AST is that you don't really have to risk breaking anything!! You use the same parser -- you're just adding a new validator and retokenizer layer, so for your users AND your lint rules the language is guaranteed not to have changed at all!

The downside is that the technology isn't ready for production usage yet, and won't be for a little while. Serious users will want to see the library hit 1.0.0, a goal which I've ensured that I can reach and am working directly towards.

I'm essentially here asking for help doing the work that makes everything I am describing possible. With the right help I could get to 1.0.0 a lot faster!

romainmenke commented 1 year ago

I think it's important to find a place for this effort so that we can split this thread.

I don't want to engage too much on specifics but I also don't want to appear dismissive of people reaching out like @conartist6 .

I think many people care about this issue and want to collaborate.

Maybe any new repository is fine? It only needs to serve as a temporary home for discussion and issues.

A place where we can align on priorities, goals, ...

ybiquitous commented 1 year ago

I can provide a new repository in the github.com/stylelint org, which would be a temporary home for our collaboration. It also would work until we would find a more appropriate home (org).

For example, how about github.com/stylelint/css-parser? I can invite a few people as the repository owner at first.

scripthunter7 commented 1 year ago

I'm also interested in this project, and as my time allows, I am happy to help with the planning / implementation. Are you planning to create a Discord server or similar communication platform?

alexander-akait commented 1 year ago

I like the idea of CST, but unfortunately the use of generic solutions is often much worse in performance due overhead (but I would look at the benches), original CSS tokens (from the syntax spec) already have everything - whitespaces/tokens/etc. Also it is good to be align with it for maintance purposes.

If someone wants to start that would be great, I'm a little busy right now. And yes anyway we need to start with the tokenizer and we already have a solutions (we can reuse them).

ybiquitous commented 1 year ago

@alexander-akait @romainmenke I've created a repository for this project and invited you as an admin. https://github.com/stylelint/css-parser

Please freely use it. Since the repo may be temporary, you don't need to follow the Stylelint organization rules.

Are you planning to create a Discord server or similar communication platform?

@scripthunter7 We have no plan at this point, but it's possible to consider it if such a platform is required. I want to leave its decision up to the admins.

romainmenke commented 1 year ago

Thank you @ybiquitous,

I will try to get the ball rolling in a few issues in the next few weeks.

silverwind commented 1 year ago

I recently saw @keithamus's csslex, maybe it is something to consider using.

romainmenke commented 1 year ago

Thank you for sharing this @silverwind That package looks really great!

I've started a list of tokenizers here : https://github.com/stylelint/css-parser/issues/1

ybiquitous commented 1 year ago

@romainmenke You can transfer this issue to stylelint/css-parser if you wish it. Of course, no problem with as-is. 👍🏼

conartist6 commented 1 year ago

I'm still working on my solution. It won't be fast in the way Rust is zoom-zoom close-to-the-metal fast, but it will be incremental, streaming, extensible, and easy to maintain -- properties that should prove highly advantageous to linters. Right now I'm working on defining an XML-based serialization format that allows my disambiguated trees to be easily sent over a wire. It's a fun example to check out because it both defines the syntax and shows how the parser core works to define syntaxes. https://gist.github.com/conartist6/5adbbf28d11497467848f530756c1c2a

conartist6 commented 1 year ago

As for the zoom-zoom part, making that method of defining syntax fast is mostly just a matter of doing some code transformation. For example if you have a production like this:

export const productions = {
  *Identifier() {
    yield eat(tok`Identifier`);
  }
}

There's a bunch of associated cost from evaluating eat(tok`Identifier`) repeatedly. But I could eliminate that cost using a hoisting transform that would change the code to something like

const hoisted_1 = eat(tok`Identifier`);
export const productions = {
  *Identifier() {
    yield hoisted_1;
  }
}

Now you can see that there's actually a pretty small amount of logic necessary to process any given production!

conartist6 commented 1 year ago

What you gain for your effort is the ability to process chunked streams. You don't need to have the entire source in a single stream, as many parsers require so that they can store indexes into the string as state.

For a linter this means gaining the ability to lint files larger than fit in memory. Memory usage would be driven more by the complexity of language and query rules than by the size of the file being linted.

conartist6 commented 1 year ago

Also tokens that index into strings tend to perform badly when you want to insert a token. The structure requires invalidating all other tokens because the indexes of all tokens after the change will need to be updated by some offset.

Mouvedia commented 10 months ago

related: biomejs/biome#268

stylelint / css-parser

[parser] Implement full featured CSS parser #2