tomlj gets confused with last comma in arrays

vieiro commented 1 year ago

The following TOML document can't be parsed.

base64 = { version = "0.21", default-features = false, test = ["std", ], optional = true }

(Note the comma in the ["std" ,], taken from arrow-rs/parquet/Cargo.toml ).

bbossola commented 10 months ago

Trailing commas are fine. The spec says it:

Arrays can also be multiline. So in addition to ignoring whitespace, arrays also ignore newlines between the brackets. Terminating commas are ok before the closing bracket. Arrays can also be multiline. So in addition to ignoring whitespace, arrays also ignore newlines between the brackets. Terminating commas are ok before the closing bracket.

bbossola commented 10 months ago

Can be reproduced with this test when added to TomlTest.java:

  @Test
  void testArrayValueWithEndingComma() throws Exception {
    TomlParseResult result1 = Toml.parse("data={fruit=[\n'apple',\n'banana',]}");
    assertFalse(result1.hasErrors(), () -> joinErrors(result1));
  }

The system weirdly will answer:

org.opentest4j.AssertionFailedError: Unexpected '}', expected } or a comma (line 3, column 11) ==> expected: <false> but was: <true>

Note the " Unexpected '}', expected }" part. This happens when an array is a value of a keyvalue pair. Unfortunately I did not find a way to fix it yet, some help from lexx/yacc experts would be welcome here! @cleishm could you take a look please?

Of course without the trailing comma after 'banana' everything works fine.

elohhim commented 9 months ago

So I wanted to use tomlj and came across this issue. Upon investigation, it seems that the problem lies in the lexer modes. There is no separate Array mode; instead, it is a part of ValueMode. Each comma, even if trailing, pushes a new ValueMode onto the stack. Now, ] is supposed to pop out of the current ValueMode; however, on top of the stack, we have two ValueModes related to the array - one for the array itself and one for a value that should come after the comma. Only one is popped, leading to an error.

I've created https://github.com/tomlj/tomlj/pull/63 as a fix. However, I'm not entirely satisfied with how I had to handle whitespaces in the ArrayTrailingComma rule. Since I'm not an expert on yacc, I'm open to someone else taking a look and potentially enhancing the solution.

cleishm commented 9 months ago

@elohhim Good analysis - that's definitely the issue. The fix looks reasonable. What's your concern on the whitespace handling?

elohhim commented 9 months ago

@cleishm thanks for quick response :)

My concern was regarding ArrayTrailingComma being defined as a compound expression',' (NL|WSChar)*. To the best of my understanding from the code analysis, this is the only place where such a pattern needs to be used. I wondered if this could, in any way, affect the whitespace logic in the generated parser. Would these whitespaces be consumed along with the comma (or maybe I am missing some details of internal antlr workings)? This consideration is relevant only if there is actual logic regarding whitespaces. However, based on tests, everything appears to work fine, so it indeed shouldn't be a problem.

cleishm commented 9 months ago

@elohhim I see what you're getting at, and you are correct. Right now whitespace tokens (which do not include newlines) are emitted to the WHITESPACE channel. With this change, that (NL|WSChar)* sequence is getting ignored entirely.

I'll think on a) if that really is anything to worry about, which is probably isn't since we don't expose the WHITESPACE channel in the library API, and b) how one might improve it.

Thanks for the fix!

cleishm commented 9 months ago

@elohhim I thought some more about it (takes a while to get your head back into ANTLR!). I'm wrong - there is a problem with the fix in #63.

Your analysis was mostly correct. The issue is related to an extra ValueMode being left on the lexer mode stack. However, this would also show up in the case of an empty array [ ], which the fix doesn't correct for.

For context, the existing code in the lexer uses ValueMode to apply lexer rules for parsing values. Each lexer rule for a value type emits one or more tokens and then pops the mode stack. However, when tokenizing an array, the code attempts to keep the lexer in ValueMode so all values in the array can be handled, and does so by pushing another ValueMode to the stack at the start of parsing an array ('[') and for every comma. That way the mode will remain in ValueMode after each value type rule pops the stack. But... this breaks for trailing commas and empty arrays, as only the additional ValueMode is popped off the stack and not the initial one.

Trying to distinguish between the case of a regular comma and a trailing comma, or an array with values vs an empty array, and dealing with whitespace, etc, starts to feel too much like putting grammar rules into the lexer. So instead, I'm going to modify the approach so that every value rule checks if the lexer is tokenizing an array and keeps the lexer it in ValueMode until the array is done.

tomlj / tomlj

tomlj gets confused with last comma in arrays #60