tree-sitter-grammars / tree-sitter-yuck

Eww grammar for tree-sitter
MIT License
3 stars 2 forks source link

feat: complete queries, make overall project look pretty #1 #1

Closed amaanq closed 1 year ago

amaanq commented 1 year ago

Same as last time haha, I was a bit lazier this time as fixing/debugging the grammar took a bit longer. I followed convention with regards to modifying the structure of precedences, iirc tree-sitter-python and c or c++ look similar to that.

Let me know of anything to edit/fix! :)

Also, same plan as last time (I hope?), I'd like to be added as a contributor again... Yuck was on my TODO list but other languages and school have distracted me. Thanks for taking the time to wriite this!

Philipp-M commented 1 year ago

Thanks for all the work you've put into this!

Well it would be a little bit easier to review, when this would've been split up across different PRs (especially the changes in the parser)... Also to avoid work for both of us.

Anyway...

I think the parser before this PR was pretty much complete, maybe needed some code beautification, but should completely describe the AST of any yuck file, well if no bugs were in there, but I tested parsing on quite a few yuck files and it produced an AST that is close to the AST of the main repo.

In general I would prefer to stay close to the parsers in the main eww repo (e.g. yuck has an embedded DSL called "simplexpr") , mainly:

This is to easily adjust to changes in the syntax and to maintain an AST that is close to the one used in the main repo. That is why e.g. symbol or simplexpr is used as node names. E.g. in (for entry in my-json ...) for is not really a widget, symbol makes probably more sense here (which is used in the main repo as well).

To be able to highlight specific nodes differently (i.e. the include node you have specified) I think it may be better to match these via e.g. (#match? @constant "^include$") etc. and maintain an AST that describes the language well (well it's a syntax tree after all)

I've seen that you changed quite a few regexes (e.g. identifier), as previously stated, I think it's better to keep the current regexes, because they are mostly copied from the original parser/lexer (e.g. for the symbol https://github.com/elkowar/eww/blob/6a64a286291456984011c4fb160def106ee55dff/crates/yuck/src/parser/lexer.rs#L68), so they closely represent what the language is really be able to do.

amaanq commented 1 year ago

I've updated parts of the grammar to remove the include rule and use regex matches to highlight def keywords and include keywords, as well as restoring symbol and its original regex. Anything more to restore? I agree with keeping it in-line as much as possible with the DSL, but tree-sitter grammars usually aren't a 1-to-1 mapping (and for some languages that's not possible :))

amaanq commented 1 year ago

The only issue now is in this screenshot:

image

"used_mem_perc" cannot be highlighted as a field/property, because all of EWW_RAM.used_mem_perc is matched as a symbol. I'm not sure if we can use a regex to highlight it, but that'd be nice imo :) unless you think it's not meant to be a field.

Philipp-M commented 1 year ago

Well, the parser is a little bit better, but overall (what I wanted to say in my previous comment) is that the parser was already done, and represents the AST already well. So IMHO the only thing that might be subject to change in the parser/grammar (apart of course of new language features added in the main repo sometime in the future) is probably a little cleanup (like prec naming, and maybe adjusting prec (I for whatever reason started with 10 and counted down to 4, so maybe subtract 3, so that the highest precision is 7). But the overall logic of the parser (including most names of the nodes) should IMHO stay. Right now, I think only the precisions in the simplexpr should probably be named and grouped together (something like PREC.json_access)

Sorry to say that, but right now I see a lot of issues in your parser, e.g. there's no float in the original parser (only one number node), there are a few operators not supported in yuck (e.g. bitwise operators), "simplexpr" is not correctly represented wherever it's used (i.e. in string interpolation, or as you noted, no "simplexpr" in your block node, see below) etc.

"used_mem_perc" cannot be highlighted as a field/property, because all of EWW_RAM.used_mem_perc is matched as a symbol.

Can you please try the original parser? This should've already worked there (it's an index node there, though the name could probably be changed to something different...)

I really recommend to read the lalrpop parser which I've linked. Looking at the real parser is IMHO generally a good way to start a tree-sitter parser. You mostly just have to "translate" the code, you don't have to guess that much, and you're learning a lot while doing it, e.g. the lalrpop parser, or (simple) recursive descent parsing in the case of ungrammar etc. I think it also will likely represent the real grammar of the language more closely.

The grammar should really just describe the AST of the real language, no special handling of nodes etc. (if it's not in the real language), only if it really makes sense (e.g. a normal regex in the queries is not enough, but I think this happens very rarely). One specific example in the case of the yuck parser I wrote is, that the original parser parses also negative numbers (and + signs): Compare the original regex(of the real parser): [+-]?(?:[0-9]+[.])?[0-9]+ with (?:[0-9]+[.])?[0-9]+ of my parser, I have instead included + as unary operator to be able to highlight it correctly, and be able to represent it in the (tree-sitter) AST better.

I think the most effective changes should be done in the highlights queries instead (and generally other queries, I have not really invested much time there as you probably noticed). E.g. if you want to differ syntax highlighting between floats and integers (whether that makes sense or not), you should probably check, if the number has a dot in it (with a match and regex probably), or if you want to highlight some symbols differently (e.g. all the builtins, like defwidget or box or for etc.) like this: https://github.com/cstrahan/tree-sitter-nix/blob/1b69cf1fa92366eefbe6863c184e5d2ece5f187d/queries/highlights.scm#L21

Sorry if that came a bit harsh, that's not my intention, apart from the parser most is fine at first look, and I appreciate all the other changes.

amaanq commented 1 year ago

Sorry I got caught up...I more or less kept it as the original parser. Some notes:

I don't really like calling ast $._ast, because it defines a scope and should be captured as so. I think $.ast also gives no hint or clue to a user looking at a tree what it actually is, so naming it $.ast_block is better imo

I am very against clumping floats and integers together, even if it is in the DSL. This might seem like a minor nitpick but using regex-based queries to distinguish them is unnecessarily costly. (I like your idea of putting +- under unary ops to highlight them properly)

I renamed property to keyword and primary_expression to simplexpr, but I think the expression block I added should stay, it adds clarity and putting it as a supertype ensures it doesn't show up in the CST.

A thing to note is, tree-sitter generates CSTs not ASTs. In this context it's trivial since so does the lalrpop parsers practically do too. But it's a distinction to keep in mind going forward :)

I do like adhering to the original parser as much as reasonably possible, but exceptions are ok when warranted, and naming doesn't really have to be exactly same (think ast -> ast-block).

Everything else seems ok to me now, thanks for bearing with me! I never even looked at the lalrpop files when I first made these changes, so thanks for linking those

amaanq commented 1 year ago

Wow, really sorry about the binops! Completely forgot to remove them...I copied over the binary_expression from python or c, it's always my go-to so I'm sorry about that, removed now

amaanq commented 1 year ago

I've sort-of redone it based on the original parser now, sorry again for misunderstanding your intentions at first.

I like how I handle escape sequences, because "escapes" such as '\A' are not really an escape sequence, and thus won't be processed as one, but '\n', '\a', etc will be. This is an important distinction in editors and, more or less, in general

Philipp-M commented 1 year ago

I like how I handle escape sequences, because "escapes" such as '\A' are not really an escape sequence, and thus won't be processed as one, but '\n', '\a', etc will be. This is an important distinction in editors and, more or less, in general

Well I agree that they might be highlighted differently (like you said only escape sequences that make sense, the rest just as string literal), but IMHO they should be represented correctly in the tree still (I think your parser fails with valid escape sequences of the language, which would be \A as example. As said, I'd keep the escape_sequence so that valid yuck can be parsed correctly, but add different highlighting for the right ones (e.g. via a match query).

Can you please check each of my comments (they took some time to write/and review)?

At a quick look a lot of my comments were not yet considered yet in the new version...

amaanq commented 1 year ago

Sorry about that, I didn't notice the "8 hidden conversations" button showing a lot more of your comments, let me get on that now..

As for escape sequences, I don't think \A should even be parsed as an escape sequence to begin with...as it's really not an escape sequence. AFAIK some grammars distinguish them, some don't, so it's more about "stick to the DSL" or "make editor highlighting easier". Using regex is costly and it'd just be easier to throw all escape_sequences under @string.escape. I'll let you make the final call on that, just thought that I should say my two cents about it.

amaanq commented 1 year ago

Ok, addressed everything but escape_sequences now, that's your final call

Philipp-M commented 1 year ago

so it's more about "stick to the DSL" or "make editor highlighting easier"

Well it turns out it's neither, I've checked the original parser in more detail again, and it seems escaped nodes are just ignored and string literal fragments are combined together. I've implemented that behavior now, with your logic for escape_sequences. I have also fixed the number node, which comment you seem to have overlooked (the behavior was a little bit different anyway (yuck only understands either 12.0 as float or 12 as integer, nothing else (missing leading zeros or no decimal places etc.))).

I have also provided a (eslint) config for prettier and reformatted the project (as prettier is kind of a standard). For the future, a tip: please don't reformat code at the same time when features are included, use a reformat PR or something like that. And in general only reformat to a different style, if the maintainers agree, reformatting can possibly create a lot of merge conflicts with different PRs (not the case here but anyway...).

The parser is now ok IMHO. I have not yet tested/reviewed the highlight queries yet (or the other things added here)

amaanq commented 1 year ago

a tip: please don't reformat code at the same time when features are included, use a reformat PR or something like that

Yes, this is a habit I need to break but having your editor automatically do it makes it a bit of a nuisance...I'll work on that :)

Let me test your changes real quick..

Big fan of that string refactor, thanks!

Philipp-M commented 1 year ago

Big fan of that string refactor, thanks!

Ok good to hear :)

I've also slightly reordered the rules, so that they read more linearly

amaanq commented 1 year ago

FYI, that reordering of node rules is unconventional, lexical precedence (which should be documented better imo), takes place based on the order of the rules, which is why in almost every grammar you have comments at the bottom and idents & literals near the bottom as well. I think for this case it's ok, but it's something to be aware of

Tests are happy so this LGTM now as well (well by test I mean the one example in the eww repo)

amaanq commented 1 year ago

Example screenshot with highlights.. image

Philipp-M commented 1 year ago

Looks good so far.

I thing, I just noticed on the highlights on your screenshot (and also a thing I've wondered already when looking at the queries) is, that strings are completely highlighted as @string. I think it should only be all the quotes/backticks and string_lit_fragment shouldn't it?

(see the string interpolation ${music} )

amaanq commented 1 year ago

Oh yes, I think that's an issue with my theme that I never bothered to look into much

See the precedence captures inspected at the music block:

image

(In the nvim-treesitter PR @embedded should be removed)

amaanq commented 1 year ago

Added a fuzz (for the scanner) and lint action

amaanq commented 1 year ago

One nitpick, can we rename string_lit_fragment to string_fragment?

amaanq commented 1 year ago

This is good to me now (besides the nitpick)

Philipp-M commented 1 year ago

One nitpick, can we rename string_lit_fragment to string_fragment?

Yeah I think that's fine as well, I've renamed it.

I think it should only be all the quotes/backticks and string_lit_fragment

I've also implemented that, as I think explicit matching is better in this case.

I did a few slight adjustments in the highlight queries, let me know if that is fine for you. Neovim seems to be quite different in giving labels compared to helix unfortunately, I'll stay with neovim style here I guess.

I changed symbol to the label @tag as I think it's more appropriate than @type

Philipp-M commented 1 year ago

Otherwise, I think this is ready to merge (if you're fine with the changes I've committed).

I'll just trust you with the other queries for now, I haven't done much tree-sitter querying apart from syntax-highlighting.

I can add you as collaborator, but please, if you change something in the parser, either wait for my review (I guess that's a good idea anyway), or double-check if the lalrpop parser parses it exactly like the tree-sitter parser.

Philipp-M commented 1 year ago

Ok one last small thing I've found in the hightlight queries (matching priority, and a useless/wrong(?) rule for idents)

amaanq commented 1 year ago

I can add you as collaborator, but please, if you change something in the parser, either wait for my review (I guess that's a good idea anyway), or double-check if the lalrpop parser parses it exactly like the tree-sitter parser.

Yeah of course. Any changes to the grammar (breaking or not) I'd submit a PR and consult with you first. The only time I'd merge without consulting is simple stuff like typos, updating a ci/readme, etc.

amaanq commented 1 year ago

Neovim seems to be quite different in giving labels compared to helix unfortunately, I'll stay with neovim style here I guess.

This is something we want to collaborate with helix (and maybe emacs) on because it'd benefit everyone. Just need to get the right push...

amaanq commented 1 year ago

I changed symbol to the label @tag as I think it's more appropriate than @type

Completely forgot about that since practically only html uses it, I agree as well

amaanq commented 1 year ago

Check latest commit, I want tags first so any subsequent queries are applied after.

(It might be that helix does this the opposite way...)

image

Without that commit, any def* and include keywords would be highlighted purple too

But thank you for the edit! Now string interpolations are parsed a bit better for me :) (well I think the issue is the @variable for this theme is just pure white or not highlighted at all, thus leading to string (green) being applied anyways in the prior screenshot)

Philipp-M commented 1 year ago

It might be that helix does this the opposite way...

Kind of I guess... and it's probably one more reason why this:

This is something we want to collaborate with helix (and maybe emacs) on because it'd benefit everyone. Just need to get the right push...

is even more important. I think helix stops at the first match (haven't looked yet at that part in helix source though, so not sure, but it feels like it does, also likely more efficient than neovim btw.), but that introduces even more incompatibilities unfortunately.

I've added one last thing (explicit keyword matching, I've quickly scanned all the keywords that are available in the eww source), I'm gonna merge that now...

amaanq commented 1 year ago

I'll likely start a discussion in Helix soon and see what we can get done, I really want to get that collaboration/query structure unification solidified...

I've added one last thing (explicit keyword matching, I've quickly scanned all the keywords that are available in the eww source), I'm gonna merge that now...

If that's the case, #any-of? would've been better, sorry for not looking for all matches before

Philipp-M commented 1 year ago

If that's the case, #any-of? would've been better, sorry for not looking for all matches before

You should be able to commit on main now, maybe change that?

amaanq commented 1 year ago

Can I just amend to your last commit?

Philipp-M commented 1 year ago

sure

amaanq commented 1 year ago

One last note for ungrammar and yuck, can you give me owner access briefly so I can add the publish.yml action? I'd just add my cargo and npm tokens in the repo secrets so that any tagged commits from 0.0.1 onwards will be auto published to both in case the grammars for either change.

Philipp-M commented 1 year ago

One last note for ungrammar and yuck, can you give me owner access

Well that's the issue (as I tried to explain in the issue in ungrammar), I would do that but it's not possible to do without an organization AFAIK.

Philipp-M commented 1 year ago

See here: https://docs.github.com/en/account-and-profile/setting-up-and-managing-your-personal-account-on-github/managing-personal-account-settings/permission-levels-for-a-personal-account-repository

amaanq commented 1 year ago

Well that's the issue (as I tried to explain in the issue in ungrammar), I would do that but it's not possible to do without an organization AFAIK.

Aha..I see. Do you have a platform I can dm you on then? Sorry for missing your response in the ungrammar repo

Email works too

Philipp-M commented 1 year ago

Sure my email should be in each of the commits ;)

amaanq commented 1 year ago

Just emailed

Philipp-M commented 1 year ago

Just emailed

and already set up the secrets, your workflows should work now with the given secrets

Philipp-M commented 1 year ago

Btw. do you think I can update the nvim-treesitter PR with the current state or is there still something you want to commit?

amaanq commented 1 year ago

Nope it's good now, go for it! Just remove @embedded