jeff-hykin / better-shell-syntax

💾 📦 ♻️ An improvement to the shell syntax for VS Code
MIT License
52 stars 4 forks source link

Nested zsh style Parameter Expansion flags throws parser #88

Open wdeshazer opened 6 months ago

wdeshazer commented 6 months ago

The code with a problem is:

deserialize_pathOpts_from_file() {
    typeset -grA _pathOpts
    _pathOpts=( "${(Q@)${(z@)"$(<pathOpts_path)"}" )
}

serialize_pathOpts_to_file() {
    "${(j: :)${(qkv@)_pathOpts}}"
} 

With all extensions disabled, the resulting code looks like:

image

To get the parser to highlight correctly I have to append the following:

deserialize_pathOpts_from_file() {
    typeset -grA _pathOpts
    _pathOpts=( "${(Q@)${(z@)"$(<pathOpts_path)"}" )  #""
}

serialize_pathOpts_to_file() {
    "${(j: :)${(qkv@)_pathOpts}}"  #"
} 

image

Interestingly, If I put on a single quote on the deserialize _pathOpts assignment, I get a syntax error indication. Which is not true unless I don't know something about shell grammar. It's just clear that this is not an officially pattern or I should be doing something different somewhere.

image

jeff-hykin commented 6 months ago

Do you know what the name of this syntax is (Q@)? Like what zsh feature its related to

jeff-hykin commented 6 months ago

(note: this comment is unrelated to why the highlighting is failing) This part, in particular the <, is surprising to me that it parses at all "$(<pathOpts_path)" I'll have to check the spec on that

wdeshazer commented 6 months ago

Yes, this is called a parameter flag or parameter subscript. It is available in both bash and zsh. Not sure about sh. 15.2.3 Subscript Flags The particular flag Q you are asking about decreases the quote level by 1. q increases it. it is possible to increase it up to three times (qqq). printf has a similar capability, but a different syntax printf "%q"

I have been looking at it this morning and I see that the variable regex's are don't provsion for them. I think there is overlap with this and issue #74. I don't have a background in Textmate, but I have extensive experience with Perl Regex. I will look at the Textmate side and offer a proposal.

wdeshazer commented 6 months ago

This part, in particular the <, is surprising to me that it parses at all "$(<pathOpts_path)" I'll have to check the spec on that

This is a downstream consequence of missing the variable identification. The quote flag is still open. I have encountered a variety of confusing behavior that I fiddled with iteratively to see if I could reverse engineer the cause. In this case, I think the other is a symptom of the variable identification.

jeff-hykin commented 6 months ago

In terms of fixing the problem. Theres a pattern, related to variable assignment, for detecting an array assignment. I think the array pattern handles named and not-named arrays. For the named arrays, the string pretty much has to be matched with a one-line pattern (not pattern range) due to parser limitations. One line patterns can't handle nested stuff, like nested strings inside of string interpolation. Instead a pattern range has to be used with a start quote and end quote.

Slight hiccup tho, Textmate prioritizes long matches. So matching one-line string (one big chunk) compared to matching just the starting quote, will cause the whole-chunk to "win".

That "win" scenario is just a warning, idk if thats even happening here. It might simply be that the pattern-range version of the string pattern isnt even included in the array-literal range.

wdeshazer commented 6 months ago

Maybe you already realize this, but the best way to read this is as nested parameters. First, you have: "$(<pathOpts_path)" with returns a string that will become an associative array which I will call pathOpts_str since I know what I call it.

It then expands to become ${(z)pathOpts_str} which the z is a flag that will iterate parse the contents of consistent with the parsing algorithm of the zsh command-line. Thus it becomes ${pathopts_quoted[@]}", which then gets passed through Q@. The @ I think is just syntactically preferred by the community, but not necessary. I haven't figured out under what conditions it is. Either way, each element is dequoted iteratively.

Two good references for this are: [Zsh Native Scripting Handbook]{https://wiki.zshell.dev/community/zsh_handbook} and Zsh Cheat Sheet

wdeshazer commented 6 months ago

"One line patterns can't handle nested stuff, like nested strings inside of string interpolation. Instead a pattern range has to be used with a start quote and end quote."

I understand. Greedy vs non-greedy matching. This can actually be overcome by clever application of look-ahead/look-behind assertions I have a lot of experience in this and have been itching to build a tokenizer, so I am looking forward to helping you out with this one. Let's see if we can crack it. I'm almost sure that Regex can hack-it -although I know (some people who know-lol) for a fact that they are not Turing complete, meaning they can't reconstruct all logically valid syntaxes. I don't know when we would hit that limitation, but I believe these constructs are well within it's wheel-house.

By the way, what limits the parser to Textmate? Is that a VSCode thing?

jeff-hykin commented 6 months ago

This part, in particular the <, is surprising to me that it parses at all "$(<pathOpts_path)" I'll have to check the spec on that

This is a downstream consequence of missing the variable identification.

I think Theres a bit of miscommunication I'll try and clear up. I don't think the parameter expansion or parameter flag or (< is related to #88. I was just curious about the parameter flag/subscript (thanks for the info on that!)

For the "$(<pathOpts_path)" I'm assuming that this would be valid on its own, echo "$(<pathOpts_path)".

wdeshazer commented 6 months ago

image

jeff-hykin commented 6 months ago

Greedy vs non-greedy matching. This can actually be overcome by clever application of look-ahead/look-behind assertions

Sort of. Yes textmate is greedy, but its separate from the regex greedyness. Like (a|ab) in regex would match just a first and be happy AFAIK (wouldnt try "ab" unless something else in the pattern failed). But textmate its more like [ Pattern(/a/), Pattern(/ab/) ] would match "ab" despite the fact that "a" matched first.

I say "kinda" because yeah, we still use a lot of lookarounds to solve it on the textmate side.

Also seeing as you're a regex expert, that will help a lot. In terms of regex hacking, there is one warning. In theory, with enough recursive regex, matching a nested string with a one-line textmate pattern is possible. Sadly I spent a lot of effort on that once for a different language only who realize that textmate will only tag the last (most-inner) part of a recursive regex pattern. So even if the pattern is matched correctly, its not tagged correctly. Ive got an issue on VS Code textmate about it, but I think you me and @ redcmd are the only people in the world who might care about the issue. That said, even with broken scopes/tags sometimes the fact that the pattern matches correctly is enough. It would be enough in this case, but the massive amount of effort it would take to make a recursive nested string pattern (a pattern that needs to contain the entire grammar thanks to subshell interpolation) would be insane for just fixing this one bug and like 2 other non-cascading bugs.

what limits the parser to Textmate? Is that a VSCode thing?

Yep. Other editors are limited to Textmate too.

The alternative is the awesome Tree Sitter parser, which would never even run into this problem in the first place. Atom used it, NeoVim uses it. I use it for parsing tasks.

Fun fact though. Bash is one of the few languages (I think Perl is another) that is impossible to statically parse perfectly. There can be runtime changes to the syntax thanks to, at minimum, aliases. So even the tree sitter can't always parse bash. Gotta run it to parse it (sometimes)

jeff-hykin commented 6 months ago

image

Yep, cool. Wouldnt be surprised if that becomes #90 on this grammar

wdeshazer commented 6 months ago

It took me forever to find this reference. I expected it to be in the redirection section of the manual, but wouldn't you know it was in the Bash documentation on Command Substitution. What's funny about that is that I searched the document for "(<". If that didn't give it away. I will start with what I know, which is the regex side and then maybe we can work together to map it to Textmate if it is possible. I gotta run right now, but I'll be back.

wdeshazer commented 6 months ago

Actually, before I run, what is the order of operation when it comes to pattern recognition. Does the engine do sweeps based on the nodes within repository (I am referring to the json in autogenerated/shell.tmLanguage.json), does it do some compound search pattern or some other algorithm altogether? I want to make sure that we have regex patterns that don't conflict. Now I will talk to you later.

wdeshazer commented 6 months ago

Ok. I have some really good solutions that need to be rigorously tested. We also should discuss, what is and is not achievable with Textmate. I did find this good Textmate reference, which had a link to this one which suggests to me the Textmate grammar should be as rich as Perl's, but who knows. We should come up with some more rigorous patterns but they worked with all of my scripts, which are pretty aggressive:

\b(\w+)(?:=)  # Variables anything with assignments

(?:")([^"/]+)(?:") # Any quoted non-path

(?:\$\{)(\w+)(?:\}) # Any simply $ marked variable

(?:\$\{)((?:#)|\w)+(?:\}) # Any simply marked variable including with counting

(?:\$\{)([^ ]+)(?:\})  # Full Variable pattern Excluded everything but the space

(?:\$\{)(?:\()(@|\w+)(?:\))(\w+)(?:\}) # Flagged Variable

# Stack overflow to the Rescue [Regular Expressions to Match Balanced Parenthesis](https://stackoverflow.com/posts/35271017/revisions)
\((?:[^)(]|\((?:[^)(]|\((?:[^)(]|\([^)(]*\))*\))*\))*\) Nested Parenthesis *Wow!!!!

Shell_variable_identification_regression.zsh

Shell Variable Identification Regression (Note this code isn't working fully. I was developing when I got sidetracked ```zsh #!/bin/zsh # _pathOpts startup is in $BIN for me $SOME_ROOT/bin pathOpts_path() { local fname="$BIN/._pathOpts.sh" [[ -e $fname ]] && echo ${fname} } < $(pathOpts_path) &2> /dev/null } # Manage _pathOpts based on flags edit_PathOpts() { local flag key path_value flag="$1"; key="$2"; path_value="$3"; while [[ $# -gt 0 ]]; do case $flag in "--add") _pathOpts[$key]=$path_value; shift 3 echo "Added path option: $key -> $path_value";; "--remove") unset _pathOpts[$key]; shift 2 echo "Removed path option: $key";; "--reset") mv "$_pathOptsPath" "$_pathOptsPath_$(gdate +%Y%m%d_%H%M%S)" unset _pathOpts initialize_PathOpts; shift 1 echo "Reset _pathOpts to default";; *) echo "Unknown flag: $flag" return 1 ;; esac done save } # Main function main() { initialize_PathOpts local debug=false [[ $1 == "--debug" ]] && shift && debug=true [[ $1 == "--_pathOpts" ]] && shift && edit_PathOpts "${(P)1[@]}" && return [[ $debug == true ]] && set -o xtrace find_folder "$@" [[ $debug == true ]] && set +o xtrace } main "$@" ```

Assignment Identification:

image

Quote Identification non-Path

image

$-wrapped Variable identification

image

Full Variable pattern - Excludes only spaces

image

Flagged Variable

image

image

image

Mother of all regex's - Nested Parenthesis - Thank you stack overflow

image

RedCMD commented 3 months ago

Slight hiccup tho, Textmate prioritizes long matches. So matching one-line string (one big chunk) compared to matching just the starting quote, will cause the whole-chunk to "win".

I wouldn't say that @jeff-hykin the regex quantifiers are greedy by default but TextMate is a first in first serve sort of priority (a|ab) would match a within the text ab, because a is first

patterns array is similar. however all regex's are run one after another then the regex that matched the earliest in the document is chosen if there is a tie, the regex that was earliest in the patterns array is chosen (L:source.langid injections are higher)

so in both cases of ('|'.+') and

"patterns": [
    { "match": "'" },
    { "match": "'.+'" },
]

the single ' will be matched first

if you attempt to also capture the whitespace before it \\w+', then that regex will have higher priority because it is earlier

EDIT: I think you've confused TextMate with TreeSitter as TreeSitter prioritizes long matches

RedCMD commented 3 months ago

I'm not sure I understand the problem

=( "${(Q@)${(z@)" "}" )

here we have two opening brackets { and only one closing } surely this is not valid bash?

placing a 2nd } appears to fix it image

deserialize_pathOpts_from_file() {
    typeset -grA _pathOpts
    _pathOpts=( "${(Q@)${(z@)"$(<pathOpts_path)"}}" )
}

serialize_pathOpts_to_file() {
    "${(j: :)${(qkv@)_pathOpts}}"
} 
wdeshazer commented 1 week ago

@RedCMD Thanks you for considering this issue. While I humbly concede that you caught a typo, I think you have overlooked the point.

The parsing scheme is not robust toward valid code. I will use your "surely". Surely I do not have to a have a bug free program entirely to have a local behavior behave correctly.

First, consider this: Why should neglecting to close a bracket within a string change the string identification at all. It shouldn't. While this "$(GobledyGook}}***" isn't going to result in a parameter substitution, so long as there isn't an escape character biting you in the butt, it should be just fine as a string, however impotent it is as code.

Second, I contend that the poor syntax highlighting aided in my missing the syntax error because there was no distinction between contexts. (That said I am slightly embarrassed 😬). So I feel that it is more the tool failing me than me failing the tool.

Next: I believe your own illustration still shows the behavior. To me pathOpts_path is highlighted like a string.

I recognize that the pattern is a challenging one. But it is a good test because it is not that extraordinary, albeit hard to read. After the open paren the pattern is: "string"SHELLSCRIPT"string" to make one word. I am willing to be corrected, but it looks to me to be highlighted exactly the opposite. I have no doubt that you recognize this, but for posterity, the following is the first string.

"${(Q@)${(z@)"

Just to illustrate, below is a correctly bracketed showing that the color is opposite of what it should be:

image

In the following image I threw in a "closing" quote before the $(<pathOpts_path) but it's still not correct with the first string. It should look green like pathopts did when the parser treated it like a string.

image

I did a deep dive that I really benefited from earlier this year, regarding the history of Textmate and it's regex engine Oniguruma and while I am super impressed with Oniguruma I read about some early tradeoffs that were made for speed at the expense of accuracy in the case of Textmate. I would rather have my computer running syntax highlighting more accurately than freeing up some energy so it can do more telemetry.

I also looked at the Ruby and the Textmate under the hood and used the developer tools to inspect what the parser was interpreting things and decided that I couldn't start this activity at that time and unfortunately I am still at that point.

On the surface though it's pretty simple. Quotes and Brackets are not at the same precedence and that's that. I'm willing to listen, because I have been known to make mistakes ;-) but on this one, I think I' pretty spot on.

image

RedCMD commented 1 week ago

nested strings inside of string interpolation

@jeff-hykin I assume nested strings require double quotes to be backslashed? so something like this would work? puts the string inside a capture, preventing internal brackets from escaping while still allowing multiline strings, if the ending double quote in on another line

image

image (\\$?\")((?>[^\\\\\"]++|\\\\.)*+)(\")

jeff-hykin commented 1 week ago

@wdeshazer

I would rather have my computer running syntax highlighting more accurately

Yeah, me too. Complain to VS Code. I didn't pick textmate.

The parsing scheme is not robust toward valid code

Fun fact, bash-like shell scripts are actually not (fully) possible to statically parse, even with the tree sitter, even with a custom built-for-bash parser. Bash syntax is runtime defined.

For example:

if (( RANDOM % 2 == 0 )); then
  alias this_works='if [ "hello '
else
  alias this_works='cat << EOF
'
fi
alias EOF=":"

this_works bob" = "hello  bob" ]; then
     echo 'bash was a mistake'
if
EOF

# sometimes that ends up being 
  # if [ "hello  bob" = "hello  bob" ]; then
  #      echo 'bash was a mistake'
  # if
  # EOF # <- null command
# other times it ends up being:
  # cat << EOF
  # bob" = "hello  bob" ]; then
  #      echo 'bash was a mistake'
  # if
  # EOF # <- end of string being echoed

If aliased are ignored (or assumed to be normal) then its statically parsable AFAIK. Just a fun fact.

wdeshazer commented 1 week ago

@jeff-hykin That's a good one! I often wonder where the line between syntax "aware" is and a full on syntax engine lies. I personally would like a tool that did the expansion real-time but I want to be able to toggle it to pause until I finish typing. LOL. It drives me batty when an engine is aggressive and I am mid thought. I'm working on trying to work out the logic myself and I am battling this beast to stop correcting me while I am typing. It robs me of focus.

I seem to recall that no one has shown that shell script is Turing complete and won't result in conflicting conclusions when all available rules are applied in all their possible parametrics. I don't know if there is a way to mathematically reduce the problem to something that is provable. But it is no doubt a hard problem and I imagine people just prefer to work, tend to apply a simpler constraint on themselves so that their codes work and move along.

That said, many parsing engines will limit the number of allowable expansions to like 3 or 7 or what have you. Infinite might be mathematically allowed, but it's not practical on a human scale.

I take your challenge though. I will look over this and see how I would approach it. Gauntlet thrown, gauntlet accepted.

wdeshazer commented 1 week ago

@jeff-hykin and @RedCMD I am going to start a discussion in addition to keeping this issue open. My thought is that there is a good discussion to be had about what can and can't be done with syntax highlighting and the three of us, and I bet some others who sound like they ask similar questions might enjoy weighing in. I'll try to keep my discussion on this thread directly related to how do we solve the issue at hand.

Oops. I assumed there was a conversation thread, but I don't think it has been enabled. @jeff-hykin would you be open to enabling the discussions feature for this repo? There are interesting discussions to be had about what is achievable and what is beyond the scope of a syntax highlighter that maybe related to this issue but would mostly be indirectly related.

image