omry / omegaconf

Flexible Python configuration system. The last one you will ever need.
BSD 3-Clause "New" or "Revised" License
1.98k stars 111 forks source link

Quoted strings are not always properly parsed by the grammar #617

Closed odelalleau closed 3 years ago

odelalleau commented 3 years ago

Describe the bug

There exist tricky edge cases where quoted strings are not parsed as expected. Here is an example:

OmegaConf.clear_resolvers()
OmegaConf.register_new_resolver("identity", lambda x: x)
cfg = OmegaConf.create({
    "a": r""" ${identity: "hello\\" }"} """
})
print(cfg.a)  # ' hello\\" } '

The output is hello\" } but the expected output is hello\"} (the difference is no space between the " and })

The reason for this lies in the regex for quoted strings:

QUOTED_VALUE:
      '\'' ('\\\''|.)*? '\'' // Single quotes, can contain escaped single quote : /'
    | '"' ('\\"'|.)*? '"' ;  // Double quotes, can contain escaped double quote : /"

Because of the \\" in the parsed string, it believes that this quote is an escaped quote, and thus keeps parsing until the next quote is found. However, the intent here is for \\" to represent a backslash followed by a quote, i.e., the string should end with this backslash.

(if you wonder why the test case looks so weird, it is so that it can run error-free with and without a potential fix)

Expected behavior

See expected output.

Additional context

There is a link to #615: if we didn't care about quoted strings ending with a backslash, we wouldn't need to escape backslashes, which would solve both issues at the same time. But I think we do care :)

omry commented 3 years ago

The output is hello\" } but the expected output is hello\"} (the difference is no space between the " and }) The output in the comment is not hello\" } but hello\\" }. Which one is it?

Looking at this: r""" ${identity: "hello\\" }"} """

I really have no idea what you are going for here. You are trying to go for a hello\ or are you trying to escape the double quote in the string so that it will appear in the resulting output? are you trying to do both?

omry commented 3 years ago

On master, the output is ' hello\" } ':

In [57]: OmegaConf.clear_resolvers()
    ...: OmegaConf.register_new_resolver("identity", lambda x: x)
    ...: cfg = OmegaConf.create({
    ...:     "a": r""" ${identity: "hello\\" }"} """
    ...: })
    ...: print(f"|{cfg.a}|")
| hello\" } |
odelalleau commented 3 years ago

The output is hello\" } but the expected output is hello\"} (the difference is no space between the " and }) The output in the comment is not hello\" } but hello\\" }. Which one is it?

Looking at this: r""" ${identity: "hello\\" }"} """

I really have no idea what you are going for here. You are trying to go for a hello\ or are you trying to escape the double quote in the string so that it will appear in the resulting output? are you trying to do both?

So, the intention is for ${identity: "hello\\" } to be an interpolation block where we pass to identity the string containing the following characters: hello\

However, the current code instead considers that the interpolation block is ${identity: "hello\\" }"} and provides to identity the string containing the following characters: hello\" }

odelalleau commented 3 years ago

On master, the output is ' hello" } ':

No, your code shows there is a backslash too (which isn't the problem -- the problem is that there should be no space after the double quote)

omry commented 3 years ago

So, the intention is for ${identity: "hello\\" } to be an interpolation block where we pass to identity the string containing the following characters: hello\

Like the following failed attempts?

# not working:
OmegaConf.create({"a": r"${identity: hello\}"}).a
...
GrammarParseError: missing BRACE_CLOSE at '<EOF>'
    full_key: a
    object_type=dict

# not working:
OmegaConf.create({"a": r"${identity: hello\\}"}).a
'hello\\'

OmegaConf.create({"a": r"${identity: 'hello\'}"}).a
'hello\\'

OmegaConf.create({"a": r"${identity: 'hello\\'}"}).a
'hello\\'
omry commented 3 years ago

Can you show me how you get Python to print 'hello\' (hello with a tailing backslash).

odelalleau commented 3 years ago

Like the following failed attempts?

Your last 3 examples actually worked, but:

OmegaConf.create({"a": r"${identity: hello\\}"}).a

=> only works without a quoted string because hello only has standard characters allowed in unquoted strings, replace it with h^llo for instance and you can't do it without quotes anymore

OmegaConf.create({"a": r"${identity: 'hello\'}"}).a

=> works because the regex extracting the quoted string actually identifies 'hello\' and doesn't go further, since there is no single quote afterwards. This is why this problem hasn't surfaced before, you need a really specific setup to highlight it (like the one in my example)

OmegaConf.create({"a": r"${identity: 'hello\\'}"}).a

=> same thing, the quoted string is 'hello\\' and the double escape gets converted in a single one during the un-escape phase of quoted strings

Can you show me how you get Python to print 'hello\' (hello with a tailing backslash).

Easy: print('hello\\')

What happened is you haven't been using print() explicitly, and the Python interpreter prints the output of __repr__() by default. And:

print(repr(r"There is a single backslash \ in this string"))
'There is a single backslash \\ in this string'

=> always use print() when there are \ in the string, otherwise it's quite confusing :)

omry commented 3 years ago

This time I paid special attention to how GitHub is rendering this crap, hopefully it will be clearer than my past attempts:

Going back to the original example:

There exist tricky edge cases where quoted strings are not parsed as expected. Here is an example:

OmegaConf.clear_resolvers()
OmegaConf.register_new_resolver("identity", lambda x: x)
cfg = OmegaConf.create({
    "a": r""" ${identity: "hello\\" }"} """
})
print(cfg.a)  # ' hello\\" } '

The output is hello\" } but the expected output is hello\"} (the difference is no space between the " and })

It's not clear to me what you are after with

r""" ${identity: "hello\\" }"} """

This seems ambiguous to me:

Interpretation 1: \\ followed by a " => \" (backslash and then the string ends), which should create a syntax error because of the characters following the end of the string.

Interpretation 2: \ followed by a \" => \ followed by " not closing the quoted string. We and up with a singleton \ after un-escaping which I think should trigger an error.

To get your desired output, this seems to work and makes sense to me:

In [45]: cfg = OmegaConf.create({"a": r'${identity:"hello\\\"} "}'}); print(f"'{cfg.a}'");cfg.a == r"hello\"} "
'hello\"} '
Out[45]: True
odelalleau commented 3 years ago

It's not clear to me what you are after with

r""" ${identity: "hello\\" }"} """

This seems ambiguous to me:

Interpretation 1: \\ followed by a " => \" (backslash and then the string ends), which should create a syntax error because of the characters following the end of the string.

That's (almost) the correct interpretation: since there are two \\ before the ", the string should end. There should be no syntax error though, because users are free to use any character they want outside of interpolations. The following examples should all be valid:

OmegaConf.create({
    "a": "Hello ${world}!",  # nothing special
    "b": "Hello '${world}'!",  # kind of a quoted interpolation, though nothing special is done with the quotes
    "c": "Hello ${world}'!",  # quote mismatch but we don't care
    "d": "Hello ${world}}'!",  # quote and brace mismatch but we still don't care
})

To get your desired output, this seems to work and makes sense to me: (...)

Just to be clear, the purpose of this issue isn't that we can't get a specific output, but to show a situation where the current code isn't outputting what it should.

omry commented 3 years ago

It's not clear to me what you are after with

r""" ${identity: "hello\\" }"} """

This seems ambiguous to me: Interpretation 1: \\ followed by a " => \" (backslash and then the string ends), which should create a syntax error because of the characters following the end of the string.

That's (almost) the correct interpretation: since there are two \\ before the ", the string should end. There should be no syntax error though, because users are free to use any character they want outside of interpolations.

But we are not outside of an interpolation. We are inside

${identity: ...}.

the ... should be a sequence of valid elements. we have ended an element (the string). the next thing should be a comma or closing the custom resolver. (ignoring whitespace).

r""" ${identity: "hello\\", foo, bar} """
odelalleau commented 3 years ago

the ... should be a sequence of valid elements. we have ended an element (the string). the next thing should be a comma or closing the custom resolver. (ignoring whitespace).

I agree, and that's actually the case in r""" ${identity: "hello\\" }"} """: the next thing after the string ends is closing the custom resolver with the }, and what comes after belongs to the top-level string and should just be copied "as is"

omry commented 3 years ago

Okay, I FINALLY understand what you did there. I thought that this was your string: "hello\\" }" and this was the outer interpolation: ${identity: "hello\\" }"}.

I will take another look at this later.

odelalleau commented 3 years ago

Okay, I FINALLY understand what you did there.

Sorry, I know it's a confusing example, but that's the whole point: it is confusing to the current parser.

I thought that this was your string: "hello\\" }" and this was the outer interpolation: ${identity: "hello\\" }"}.

So did the grammar :)

omry commented 3 years ago

the ... should be a sequence of valid elements. we have ended an element (the string). the next thing should be a comma or closing the custom resolver. (ignoring whitespace).

I agree, and that's actually the case in r""" ${identity: "hello\\" }"} """: the next thing after the string ends is closing the custom resolver with the }, and what comes after belongs to the top-level string and should just be copied "as is"

r""" ${identity: "hello\\" }"} """

So your intention is that the identity is getting "hello\", and outside of the custom resolver you have }" (space, brace, double quote)?

Shouldn't the trailing double quote be an error then? It should be the beginning of a new quotes string that is never terminating.

At a high level, I agree that when we see "hello\\", we should interpret it as a double quoted hello\.

odelalleau commented 3 years ago
r""" ${identity: "hello\\" }"} """

So your intention is that the identity is getting "hello\", and outside of the custom resolver you have }" (space, brace, double quote)?

Correct.

Shouldn't the trailing double quote be an error then? It should be the beginning of a new quotes string that is never terminating.

No because what's outside of interpolations is free-form, you can use quotes however you want. But you can add an extra quote if you prefer -- it shouldn't change the underlying issue.

At a high level, I agree that when we see "hello\\", we should interpret it as a double quoted hello\.

Ok good, we're on the same page then :)