hashicorp / hcl2

Former temporary home for experimental new version of HCL
https://github.com/hashicorp/hcl
Mozilla Public License 2.0
373 stars 66 forks source link

Scanning error for quoted templates ending with $ or % #102

Closed apparentlymart closed 5 years ago

apparentlymart commented 5 years ago

Given the following input:

is_percent = local.percent_sign == "%" ? true : false

The quoted string literal token after the open quote seems to consume the remainder of the line, and all following lines. In other words, it prevents the end pattern " from matching and thus prevents the scanner from returning to the normal config scanning mode.

This seems to be the result of some faulty assumptions about Ragel. The ability for us to treat isolated % and $ introducers (not followed by {) as quoted literal string parts is implemented by some special sequences:

            ('$' ^'{' %{ fhold; }) |
            ('%' ^'{' %{ fhold; }) |

The goal here is to look ahead to see if the byte after the introducer is {, and if not to back up one byte so we can emit the $ or % in isolation and then continue scanning at the subsequent {.

It seems though that somehow then the " matches as a string literal character even though that character is specifically excluded from the set of valid string literal characters unless preceded by a \, as we can see in the full rule:

        TemplateStringLiteral = (
            ('$' ^'{' %{ fhold; }) |
            ('%' ^'{' %{ fhold; }) |
            ('\\' StringLiteralChars) |
            (StringLiteralChars - ("$" | '%' | '"'))
        )+;

I have not yet determined how the TemplateStringLiteralRule is then proceeding to consume the ", but I do see it as part of the resulting TokenQuotedLit token along with all of the following bytes on the line.

We did already have some tests for % and $ appearing right before ", but all of those existing cases had the " at the end of the input, and so I suspect that the longest-match-wins rule is playing a part here somehow. I saw some better results by moving the + modifier at the end of the pattern to instead apply only to the final sub-pattern (StringLiteralChars - ("$" | '%' | '"')). That causes a more suitable result for this input, at the expense of causing us to generate multiple consecutive TokenQuotedLit when literal $, %, and character escapes are present. But the result ought to be functionally equivalent, so maybe that'll be a reasonable path forward to fix this up without significant rework of the grammar.

apparentlymart commented 5 years ago

Ahh, I just remembered why moving the + doesn't work: it causes an escape like $${ to not actually escape because the pattern ends and the following bytes are ${ and the state about it being an escape are lost. Further investigation required, then...