idank / bashlex

Python parser for bash
GNU General Public License v3.0
550 stars 94 forks source link

comment parsing # #47

Closed sereysethy closed 2 years ago

sereysethy commented 5 years ago

I checked in the test case, there is a test case for comment but in my application, when I tried to parse the comment (e.g # foo), it failed

File "lib/python3.6/site-packages/bashlex/parser.py", line 611, in parse
        ef.visit(parts[-1])
File "lib/python3.6/site-packages/bashlex/ast.py", line 35, in visit
        k = n.kind
builtins.AttributeError: 'NoneType' object has no attribute 'kind'

I used the latest version 0.14.

sereysethy commented 5 years ago

The problem that I notice is that when I tried to parse the line that contains only # foo bar, this fails. But if I parse foo # bar, this passes. My question is that is it normal?

idank commented 5 years ago

No, this seems like a bug.

BlankCanvasStudio commented 2 years ago

The Error associated with this bug is: ERROR: Error : NEWLINE . <WORD@16:31 'somenextlinecommand'>

This interaction is caused because the parser has no mechanism for skipping a blank line.

The tokenizer sees a '#' and then skips the rest of the line. If the line is entirely a comment, then the line gets passed to the parser as a blank line (i.e. just a '\n' character). Then the parser fails as it has no method to interpret only new line characters followed by another line. This bug can be replicated if you simply add a blank line between any 2 commands you try and parse, as the situations are no different from the parser's point of view.

Further, this issue does not appear if the line appears in a for loop, an if statement, or at the end of the file.

It's worth noting that the tokenizer works perfectly, so the error is isolated to the parser. An update to the grammar seems necessary and a parser update is definitely necessary.

I'm working on a fix for this currently.

idank commented 2 years ago

Hey, nice job digging into this. Since you seem to have an idea of what needs to be fixed, would you be able to compare that with how bash handles this? It might be something I missed in the original pass I made, or perhaps differences in lex/yacc and the python implementation.

BlankCanvasStudio commented 2 years ago

Yeah, I'm looking through the source code currently. I'll post an update if I find anything.

BlankCanvasStudio commented 2 years ago

Just making a note for the sake of accuracy and documentation. Putting a comment at the end of a line also breaks the code and seems to be a similar bug. The comment is treated line a \n after its been removed so the parser doesn't know how to deal with the new \n \n case caused by replacing the comment with a \n and the \n found at the end of a line.

BlankCanvasStudio commented 2 years ago

The errors for inline comments, single line comments, and blank lines are all caused by the same bug.

When the parser encounters a blank line while in the initial state 0, it transitions to state 3. Then, while in state 3, the only option is $end:-2. This means that the only place there can be a blank line, is at the end of the file or when the state is not 0 (ie inside a for loop). If any token other than EOF follows while in state 3, the parser pops an error because t = None in line 370 of yacc.py.

This same bug manifests itself in comments because of how this implementation deals with them. In this implementation, the comment is simply converted into a \n. When the only character in the line is a comment and the initial state is 0, the parser converts the comment to a \n. It then transitions to state 3 and if its not the end of the file, pops an error, as EOF is the only transition option available to state 3. In the event of an inline comment, the comment is converted to a \n. This means the line ends 1 character earlier than it should have (the actual end of the line \n still remains). Thus the state goes to 0 before the end of the line, the \n is read, and the same error is popped.

This is backed up by the current code where the last line of the file can be a comment, blank, or have a comment. Further, for loops and if statements have no issues with blank lines or comments as the state when they are encountered is not 0.

This problem has a number of different solutions. Firstly, comments can just be changed to "" instead of "\n" and this would add support ONLY for inline comments, without any major issues. Adding a state transition from 0 -> 0 when a new line is encountered would solve the blank line, single line comment, and inline comment issues without any other intervention necessary. The only problem with adding this transition is there are comments which indicate that blank lines are added to the file artificially and intentionally (one such instance is line 250-2 of tokenizer.py). One could also make the tokenizer lump a group of \ns as one character, instead of leaving them as multiple. This could just be done by peaking the next character when the current character is \n, dropping a \n if the next char is also a \n, and looping this until the next character isn't a \n.

Bash, if I'm not mistake, doesn't drop comments in this way and has an actual token for them which is ignored when the code is executing. This means that any of the above implementations are just as valid, provided that one doesn't want to implement the comment system in bash. I think implementing comments as tokens isn't a great idea for usability and maintainability with older versions of the code, but I don't own the repo. As such, I was wonder what your thoughts on which implementation seems preferable. (I personally favor the tokenizer method as it doesn't touch the grammar and fixes all 3 issues).

idank commented 2 years ago

That's an excellent analysis, thanks. You're the expert on this now! :-)

One thing I couldn't understand is why you think adding a token for comments isn't a good idea. In terms of having a complete parser, that seems ideal.

BlankCanvasStudio commented 2 years ago

In terms of having a perfect match to the bash parser, implementing comments is ideal. But implementing a comment token doesn't actually solve the issue. The problem is the lack of transition from state 0 -> ( state 0 with lineno++ ) when reading a new line character. This issue manifests itself in comments since they add additional "\n"s, but is not actually a bug with the comment implementation itself. If there was a comment token, the blank line issue would still exists. But if the blank line issue is solved, then both the comment bug and the blank line bug drop away. So, as I see it, implementing comments is nice but isn't actually a solve to the problem. Its a solution to a different issue which solves only part of the current problem almost by accident.

I think the tokenizer fix is ideal as it would maintain any \n inject methods currently employed (assuming I'm reading the comments right), it would maintain the grammar without any edits, and it effectively adds a 'pass blank lines' system into the whole parser. One could change the state transitions but I don't know how bash does it (I do know how I would fix it but those are separate) and I don't know how this is going to affect other parts of the grammar.

BlankCanvasStudio commented 2 years ago

This bug is also why blank characters at the end of a line fail. ie "echo this; " should be parsed but fails because the blank character at the end is skipped, keeping it in the init phase, then the newline is encountered, and the program crashes.

idank commented 2 years ago

Ok, that seems reasonable. Do you have a fix in the works?