fix comment parsing - Githubissues

Philipp-Binder commented 10 months ago

Hi together,

I investigated how to replace the comment-detection via split with direct parser-functionality.

Doing this I think I replaced some bugs too. I've tried to document this in the first commit by adding more ParserTests showing the buggy behaviour and adding todos for recommended "fixed" tests.

Would be great to hear your opinion regarding the introduced tests.

rogusdev commented 10 months ago

I will take a look at this at some point soon. Thank you for investigating and presenting a potential solution, with more tests!

rogusdev commented 10 months ago

Heads up that in investigating this, I did some trivial clean up that will require rebasing: https://github.com/tonerdo/dotnet-env/pull/85

rogusdev commented 10 months ago

I had a bit of inspiration and put together: https://github.com/tonerdo/dotnet-env/pull/87 -- which does everything I was aiming for, to my perception: no regex splitting required to get comments on unquoted strings.

But it highlights that we now have one ambiguous case:

            testParse("EV_DNE", "", "EV_DNE=#no value just comment");

I have left this failing in this branch, because we need to talk about it. (I believe you touched on this with https://github.com/tonerdo/dotnet-env/pull/82#discussion_r1424412109 )

Consider 2 similar examples back to back (also in the PR)

            testParse("EV_DNE", "http://www.google.com/#anchor", "EV_DNE=http://www.google.com/#anchor");
            testParse("EV_DNE", "#anchor", "EV_DNE=#anchor");

Now, of these two, the first one should be unambiguous: we want a URL that happens to include a # char, but is just part of the unquoted string -- this is exactly a planned for use case.

The second of these two gets a bit fuzzier, but in the context of the first one, would seem reasonable that we want the ability to start a value with a # because sometimes that is a valid value.

However... obviously this means that if someone wanted to do:

            testParse("EV_DNE", "", "EV_DNE=#http://www.google.com/#anchor");

I.e. they wanted to comment out the entire value for EV_DNE in the .env file, but not the entire line itself then they could not do that, if we allowed values to start with a # (like the #anchor example above).

So: do we want values to be allowed to start with a #? And if you want to comment out a value, you just comment out the entire line, and put an extra line that sets ENVVAR= (i.e. empty) or ENVVAR= #comment which for sure works?

Or: do we want #s to only be valid in the middle of a string and not at the beginning?

And does this cover all the cases you were thinking of? I'm not quite sure yet.

rogusdev commented 10 months ago

OK, I think I talked myself into changing the behavior to make EV_DNE=#no value just comment truly just a comment, because in the current version of things, this fails:

            testParse("EV_DNE", "", "EV_DNE= #no value just comment");

Notice the space in front of the # -- I would definitely want this to be a comment, no question, and therefore this should pass, and that suggests to me that EV_DNE=#comment should also pass and with no value.

I have updated my PR accordingly

rogusdev commented 9 months ago

@Philipp-Binder before you go further on this PR -- tho I hope you are having a fun! :) -- I would like you to review my PR as an alternative proposal that I believe is a simpler and cleaner solution: #87 and give me any feedback on that, if you think there is anything missing there such that your approach is preferred, please.

What I mean is: do you think there is anything missing in my PR that you think your PR handles? I don't believe we need both.

Philipp-Binder commented 9 months ago

@rogusdev

What I mean is: do you think there is anything missing in my PR that you think your PR handles? I don't believe we need both.

I think the duscussion moves us in the right direction, but there are still some problems in both solutions. In the end we will need only one of them, yes, but at that point I believe we challenge us to a better solution.

With your approach I think we currently have a problem with trailing whitespaces in unquoted values if I'm right, I will comment it in your PR.

I have an idea about a slight different approach, I will explain it after updating my PR. And I will close some conversations above now, it gets a little confusing now. If you think some of them are still relevant you're welcome to reopen them.

Philipp-Binder commented 9 months ago

With your approach I think we currently have a problem with trailing whitespaces in unquoted values if I'm right, I will comment it in your PR.

Okay, I now understand why we do not have a problem with trailing whitespaces with your approach. Of course the tests show that this is not the case, but understanding the Unquoted-Parser is quite hard now. That's the only remaining problem then, and I will try to give a solution having that in mind.

One point where we currently differ: You say my solution tries to check for inline comment and yours don't, but by checking for a # after some whitespaces of course you do check for inline comments. You just do not explicitly name it, but otherwise you wouldn't be able to prevent the trailing whitespaces. Also you do check for linebreaks (aka controls) within Unquoted values, otherwise you wouldn't be able to stop the value before it. And this is kind of equal to quoted values, difference is onle that with quoted values is that we have the quotes as a defined start and endpoint instead of such constructs like whitespaces(optional) followed by #, ' or "

rogusdev commented 9 months ago

On this one point: "check for" is different than "parse". Parsing consumes the characters, whereas checking for prevents consuming the characters. By not consuming comments in the unquoted parser, the Assignment parser can consume them instead, in a more consistent fashion with the other Value parsers.

Philipp-Binder commented 9 months ago

Then we are on the same page, my approach using Except does not consume (aka Parse with your definition) comments too. That is still done by the Assignment parser.

InlineWhitespaceChars.Many().Then(w =>
                    NotControlNorWhitespaceChar("#'\"$").Once()

This piece of code of your approach parses InlineWhitespaces and if it parses any "disallowed" character afterwards the parser fails, resulting in throwing an exception or going on with an Or(...) or whatever is specified next, and parses frorm where the InlineWhitespace-Parser started, like a "rollback". The "not consume" is just because the parser fails, which is the same in my approach.

The real difference between the solutions is about the InlineWhitespace-Handling. and about the set of allowed characters inside an unquoted value. Not sure if it is a must have, but the following test will fail with your solution.

Assert.Equal("allow 'whitespace' before 'quotation' inline", Parsers.UnquotedValue.End().Parse("allow 'whitespace' before 'quotation' inline").Value);

But except for this I support your approach as long as I do not find a better way to do it ;)

rogusdev commented 9 months ago

Apparently there is a hot key to close a PR...? Not so helpful, github :(

rogusdev commented 9 months ago

Not sure if it is a must have, but the following test will fail with your solution.

I addressed this above on this PR here: https://github.com/tonerdo/dotnet-env/pull/82#discussion_r1425612902

And I added an explicit comment to explain the tests I added to address this concern in my PR: https://github.com/tonerdo/dotnet-env/pull/87/files#r1429207389

In short, I do not consider it a must have -- quite the opposite, it leads me to question some of the other things that are being allowed by these changes. All I wanted was for comments to not get consumed ;) I started expanding to these changes based on your interest, and because it actually made the code easier, but I consider it a slippery slope to allow quotes in unquoted values.

In truth, at this point, I am strongly leaning towards removing quote chars from unquoted vars because (in bash):

EVDNE=abc
echo $EVDNE
# abc
EVDNE=a b c
# b: command not found
EVDNE=a 'b' c
# b: command not found

Fish is a bit more forgiving:

set EVDNE abc
echo $EVDNE
# abc
set EVDNE a b c
echo $EVDNE
# a b c
set EVDNE a 'b' c
echo $EVDNE
# a b c
set EVDNE a'b'c
echo $EVDNE
# abc

But notice that even tho it does not complain about the b char like bash did for an unquoted value, it still does not include the quotes when that 'b' is in quotes.

So, at this point, I am strongly leaning towards removing quotes from unquoted values, like it was before. They either break in bash env vars (which is what I want .env files to match) or else do not do the same thing in fish.

Thoughts?

rogusdev commented 9 months ago

In fact, maybe I should just remove whitespace support from unquoted values entirely, given that example. We have quoted values to handle such things, and I want to be as close as possible to bash/posix/etc env var handling as possible.

rogusdev commented 9 months ago

I am strongly leaning towards removing quotes from unquoted values and In fact, maybe I should just remove whitespace support from unquoted values entirely

I did both of these on my PR: https://github.com/tonerdo/dotnet-env/pull/87/commits/c21430928e52f062a6e12109631668f31be8e2ff

But then I reconsidered and re-added inline whitespace, to minimize breaking changes: https://github.com/tonerdo/dotnet-env/pull/87/commits/2a4fec859d437549b2e08a31d6cbf1b590f62fe0

But tbh I'm still not sure how I feel about inline whitespace. It does occur to me now that allowing this will break source-ing these files, but idk how important that is, and I also don't know how often people are using inline whitespace to begin with. So...

Right now I'm leaning towards merging my current PR, which allows inline whitespace, but prevents quote chars, in unquoted values, release that as 2.6.0 and then maybe release 3.0.0 as no inline whitespace. Thoughts?

Or, eh, just leave it all as-is, minimize the changes -- no one is complaining about inline whitespace, and no one is asking for more support for source-ing, so... I guess it ain't broke, and thus needs no fixing.

Philipp-Binder commented 9 months ago

That's the ugly thing about the mising standard :D

My first intention was: yes, drop whitespace-support and take bash as de facto standard.

But my second thought was: What is the real point for me to have an env-File. It's mostly about having a super easy way to have env-vars at development-time without modifying the environment variables (and having problems with other projects and so on...) And then I'm back to: let's have the user-side as easy as possible. And as far as I'm concerned, I use unquoted values nearly exclusively, just using quoted if theres no other way. I just want to paste my value and use it, not more, but not less. And then I'm back at: let unquoted values have any value except for special leading chars (whitespace, quotes, hash) and no trailing quotes, that's it.

The thing is, should it be compatible with the capability of bash variable assignment? A little bit of research gave me that Gist: https://gist.github.com/judy2k/7656bfe3b322d669ef75364a46327836 The discussion points at this repo, which tries to extend bash env-capabilities with using some definitions from other tools (like npm, docker etc, which all use their own definitions... quite ugly without a standard...): https://github.com/ko1nksm/shdotenv

That's also possibly not the best repo for a standard we want to use too, for example it does not allow whitespaces around the equal-sign, which is a good thing to be supported in my opinion. And it disallows whitespaces in unquoted values too (didn't try, but I expect it to be that way reading the readme).

What I try to say: I don't think that bash variable assignment is the best standard to be used, having local development/user experience in mind.

rogusdev commented 9 months ago

Yeah, that's why I put whitespace back in. But I still think quotes in unquoted is a bad idea. Do you have an example case where you want to do that?

For me, I'm worried someone does this:

EVDNE=a'superimportant thing'

Where the leading a is a typo and would just be confusing for it to suddenly make the quotes somehow up in the value, rather than an error highlighting the typo. We have quoted values, we should encourage them, imo.

I am feeling very ok with my PR at this point. I am grateful for the discussion that has led here, we've covered good ground. But I want to 1) minimize sudden (breaking) changes and 2) keep the code clean and relatively straightforward to read. Fixing just the actual code cleanup goal, and supporting a couple of previously broken cases, but not changing behavior too much with allowing quote chars in unquoted suddenly, and certainly not removing whitespace support... I think this is a good change for a now venerable and otherwise very stable library.

Philipp-Binder commented 9 months ago

I don't think a typo like that is a real good reason, because we are not supposed to cover typos I guess, and you will recognize this typo quite fast when you check the resulting value.

My example-case would be a passwort which is something like D[}9P"LlBf!{Fg*%9S7]. You already have that annoying problem with Interpolation (which is a good thing, but really annoying with passwords).

But maybe the better strategy here is using singlequotes for passwords, because it additionally prevents unintended Interpolation. With some highlighting (for example https://plugins.jetbrains.com/plugin/9525--env-files-support) you also see problems due to inline singlequotes quite good. grafik

Saying that, I support you in disallowing quotes within unquoted values.

Unfortunately these .env-highlighting Plugins have their own definition due to the lack of standardization too... your example shows up like that: grafik The one with whitespaces complains only due to the inline whitespaces. There are more ugly things with that analyzer, but that's not the point here of course ;)

Then there is only one thing I want to figure out now, and maybe preferring my PR due to that (after changing quote-behavior of course): I would say it is a good thing to explicitly disallow certain leading-characters and add testcases for that, instead of relying on the behavior of the Assignment Parser. That is equivalent to DoubleQuoted and SingleQuoted which expect the " or ' as the first parsed character, whitespaces have to be handled before, otherwise the Values are not recognised. For UnquotedValues the difference is, that you do not have a special "initial" character but a set of explicitly disallowed ones. I will comment that at your PR, including a test checking the behaviour.

rogusdev commented 9 months ago

I look forward to your test case, because I am not sure what you mean about explicitly disallowing certain initial characters, other than what is already explicitly disallowed, and does have test cases for?

The rest all sounds good, and looks like the env extension agrees that my example is relevant, and a good reason to not allow quote chars.

Philipp-Binder commented 9 months ago

Comment is added (https://github.com/tonerdo/dotnet-env/pull/87#issuecomment-1860547670).

I also updated my PR and doublechecked whether I have removed or changed tests. ==> there are just additional tests now, that can be discussed of course, but accordingly the solution is only slightly more strict compared to before, nothing additionally added. As a bonus: all the examples are still working, so no removal of support for any given testcase.

Not sure if we removed anything now :D I forgot if quotes were somehow possible with the old solution, if so, that is the only breaking change, but that is intended as discussed.

So the main point of the PR is now the initial point: use Parser instead of Split/Trim and while at it, guard the UnquotedParser a little better.

Philipp-Binder commented 9 months ago

Red-Green Test-List:

It's more or less the same list as in the "twin"-PR (compare https://github.com/tonerdo/dotnet-env/pull/87#issuecomment-1870680310)

EnvTests
- ParseInterpolatedTest
- Assert.Equal(" surrounded by spaces ", Environment.GetEnvironmentVariable("TEST_UNQUOTED_WITH_INTERPOLATED_SURROUNDING_SPACES")); --> old version trimmed the final calculated value
- Assert.Equal("Parsing failure: Unexpected end of input reached; expected ' (Line 1, Column 20); recently consumed: QUOTE='abc", ex.Message); --> message changed (better user-experience)
- Assert.Equal("Parsing failure: Unexpected end of input reached; expected \" (Line 1, Column 20); recently consumed: QUOTE=\"abc", ex.Message); --> message changed (better user-experience)
ParserTests
- ParseUnquotedValue
- Assert.Throws<ParseException>(() => Parsers.UnquotedValue.Parse("'")); --> old version returned empty value
- Assert.Throws<ParseException>(() => Parsers.UnquotedValue.Parse("\"")); --> old version returned empty value
- Assert.Throws<ParseException>(() => Parsers.UnquotedValue.Parse(" ")); --> old version returned empty value
- Assert.Throws<ParseException>(() => Parsers.UnquotedValue.Parse("\t")); --> old version returned empty value
- Assert.Equal("", Parsers.UnquotedValue.Parse("#").Value); // no value, empty comment --> old version treated this as value
- Assert.Equal("", Parsers.UnquotedValue.Parse("#commentOnly").Value); --> old version treated this as value
- ParseValue
- Assert.Throws<ParseException>(() => Parsers.Value.Parse("'")); --> old version returned empty value
- Assert.Throws<ParseException>(() => Parsers.Value.Parse("\"")); --> old version returned empty value
- ParseAssignment
- testParse("EV_DNE", "", "EV_DNE=#no value just comment"); --> old version treated this as value
- testParse("EV_DNE", "", "EV_DNE= #no value just comment"); --> old version treated this as value
- testParse("EV_DNE", "test", "EV_DNE= test #a'bc allow singleQuotes in comment"); --> old version did not allow this
- testParse("EV_DNE", "test", "EV_DNE= test #a\"bc allow doubleQuotes in comment"); --> old version did not allow this

Philipp-Binder commented 9 months ago

I tried to catch all todos, hopefully I did not forget something, otherwise please mention it!

Last open Thread I guess is is UnquotedValue/UnquotedValueContent okay now --> https://github.com/tonerdo/dotnet-env/pull/82#discussion_r1437756522.

edit: did not see that you already answered ;) Then I guess I await a final review (given I did not forget something...) 👍

tonerdo / dotnet-env

fix comment parsing #82