influxdata / flux

Flux is a lightweight scripting language for querying databases (like InfluxDB) and working with data. It's part of InfluxDB 1.7 and 2.0, but can be run independently of those.
https://influxdata.com
MIT License
767 stars 153 forks source link

too many European quotes cause Flux to panic instead of error #3160

Closed rickspencer3 closed 4 years ago

rickspencer3 commented 4 years ago

a single set of European quotes causes a parse error (as expected), but if there are some number more, it causes a panic.

nathanielc commented 4 years ago

The quotes trigger this Rust panic https://github.com/influxdata/flux/blob/master/libflux/src/core/scanner/mod.rs#L108

We need to both understand why the quotes were able to trigger the code path that panics and change the code path to report an error instead of panicking.

scbrickley commented 4 years ago

Just tracing a possible path of execution here. It looks like the scanner.offset() function that's panicking is mostly called inside of parser.source_location(), which itself is called in 11 different places, all in libflux/src/core/parser/mod.rs. The one that jumps out to me the most is this call in parser.parse_string_expression(), which runs if the parser finds an unexpected token.

The thing that I'm not clear on is if the parser would even know to call parse_string_expression() if it doesn't even recognize the character as a proper "quote" character.

Also, what exactly is the purpose of the offset() method? I'm still pretty unfamiliar with this code base.

scbrickley commented 4 years ago

Someone else should confirm, but when I tried to recreate the issue in oss on some sample data, I got different errors based on where the European quotes were. If the last set of quotes before the end of the script is replaced with European quotes, like this:

from(bucket: "NOAA")
  |> range(start: v.timeRangeStart, stop: v.timeRangeStop)
  |> filter(fn: (r) => r["_measurement"] == "h2o_feet")
  |> filter(fn: (r) => r["_field"] == "water_level")
  |> filter(fn: (r) => r["location"] == “coyote_creek”)

I get unreachable executed as an error.

But if any set of quotes other than the last one is replaced with European quotes...

from(bucket: "NOAA")
  |> range(start: v.timeRangeStart, stop: v.timeRangeStop)
  |> filter(fn: (r) => r[“_measurement”] == "h2o_feet")
  |> filter(fn: (r) => r["_field"] == "water_level")
  |> filter(fn: (r) => r["location"] == "coyote_creek")

I get compilation failed: error at @3:24-3:60: expected RBRACK, got RPAREN (or something similar).

Might be nothing, but it seems like the problem could be the positioning of the quotes, rather than the number.

scbrickley commented 4 years ago

Update: I am now able to reproduce the panic in a unit test, and have found a different but related panic.

It seems whether or not this specific panic triggers depends on if the portion wrapped in european-style quotes is itself wrapped in parentheses/brackets/curly braces. If it's not, we get an entirely different panic; something to the tune of slice index starts at 18, but ends at 17.

For text wrapped in european quotes, but not wrapped in parentheses, inserting any character immediately after the euro-quote (even a close-paren, as long there's no matching open-paren) seems to suppress the panic.

For a string wrapped in european quotes that is then wrapped in parentheses, inserting any character after the close-paren seems to suppress the panic.

To better illustrate this, here's a list of strings I've tested, and the results I get when I pass them into the parser:

“some string” // panics `slice index starts at n but ends at n-1`
(“some string”) // panics `position should be in map`
[“some string”] // panics `position should be in map`
{“some string”} // panics `position should be in map`
(“some string”)\n // no panic
((“some string”)) // no panic
“some string”\n // no panic
“some string”_ // no panic
(“some string”)_ // no panic
“some string”) // no panic

It's worth noting that, even when the parser manages to get through the string without panicking, any character that comes immediately after a euro-quote is skipped (e.g., when the parser tries to grab the text inside the euro-quoted string “some string”, it only gets ome string).

Something about the way we are handling the invalid character is causing the parser to skip over characters that come immediately after it. As a result, the index for the start of a token "slides" forward. When the parser is looking for the start and end of a token with a length of 1 (e.g., a close-paren), it's easy for two "skips" to push the index for the start of a token past the index for the end of the token. When this occurs at the end of a buffer, offset() calls positions.get().expect() on a position that doesn't exist, and panics.

Similarly, when parsing a euro-quoted string not wrapped in parentheses, we are sent down a different path of execution that hits scanner._scan(), where we try to create a slice with a token_start index that exceeds the length of the buffer (inside of an unsafe block, no less). Hence the error slice index starts at n but ends at n-1

For the original panic, position should be in map, the backtrace looks like this:

A call stack that orginates from parse_file() eventually calls parse_paren_body_expression()... ...which calls base_nodes_from_tokens()... ...which calls source_location()... ...which calls offset() and triggers the panic that nathaniel linked above.

Still working on exactly why some characters are skipped over.

scbrickley commented 4 years ago

I think I've figured out what's going on. Here's an explanation of the issue, or at least my understanding of it.

The Core Problem

The problem originates in scanner.c which is generated from this Ragel file. The generated scanner code amounts to one big loop (implemented with a bunch of goto statements). At the end of this loop, we increment a pointer (p) by one byte and check if it's equal to another pointer (pe) that marks the end of the buffer. If it is, we continue out of the loop. If it's not, we get sent back through the loop again. There appear to be 3 ways out of the loop:

  1. We hit the end of the buffer (i.e., p == pe)
  2. We scan a complete token
  3. We hit an an unrecognized set of bytes (i.e. a character not defined in unicode.rl)

When we encounter that third scenario, we should exit out of the function as soon as possible and let the rust code do the clean up.

That pointer, p, tracks the scanner's progress through the string being scanned, and it's the same pointer we use to tell rust where the start and end of every token is. The core of the problem lies in the fact that the scanner may need to process more than one byte in order to realize that it's hit an unrecognized character, (presumably because that character may share its first byte or two with other characters that are recognized) so it may take one or more loops before it knows it needs to exit and report an error. So, by the time we exit the function, p has already been incremented at least once.

After the scan function reports an error, the Rust code tries to clean up after it by finding the size of the unrecognized character, and adding it to p, trying to skip over it so it can scan the rest of the text. This is exactly what we want the Rust code to do, but this approach breaks down when the pointer has already been moved by the C scanner.

Then, in the next call to scan, a variable called token_start is set to <starting position of p> - <start of buffer>. We then use that as the starting index of a slice to grab the next token. As a result, the first byte that comes after the invalid character is skipped over. If this occurs in the middle of a buffer, it results in the scanner skipping any ASCII character immediately following the invalid character. If it occurs at the end of a buffer, we scan a byte in an invalid memory space, scan returns an error, and we again try to create a slice using token_start, but this time the index goes beyond the end of the array, and causes a panic. This underlying behavior is also what triggered the panic that @nathanielc linked above.

Proposed solution

When we encounter an unrecognized byte, we need to reset p to whatever it was when passed into the function. Luckily, this is a pretty straightforward fix, since the starting position of p is saved in another pointer (ts) near the start of the function. Unluckily, since the fix requires changing generated code, we can't implement it inside of the flux repo, as it will be overwritten whenever the CI runs make libflux/scanner.c. We will have to patch Ragel so that the generated code includes this change

PR with the suggested fix + unit tests: #3187

nathanielc commented 4 years ago

@scbrickley Great explanation. The bug and fix make sense to me. Let me repeat to see if I got it.

The C scanner is scanning bytes it has no concept of utf8 characters. If it finds an error (i.e. a sequence of bytes that are not allowed as per the scanner rules) then it returns the position of the last byte that didn't match. This means that we could be pointing to the middle of a utf8 character. This naturally confuses the Rust code that tries to read and skip that utf8 character. The solution is to point to the first byte we read (as part of the call to scan) when we encounter an error instead the last.

If this is true then any utf8 char that is more than one byte and not an allowed utf8 character (i.e. not in unicode.rl) should run into this bug. Can we test that to validate we understand correctly? Maybe with the register char ® ? Its byte sequence is c2 ae

If this is all correct I think there is another solution we can apply. In the Rust code the problem is that self.p get mutated by the call to C scan. That is normally what we want, but as you have found in the error condition the C scan does not leave self.p pointing at the beginning of a utf8 char. What if right here https://github.com/influxdata/flux/blob/master/libflux/src/core/scanner/mod.rs#L170 we reset self.p to the value of token_start? Its basically the same code you have in the PR but on the Rust side and only as part of the error condition logic. Thoughts?

scbrickley commented 4 years ago

@nathanielc You have it right, and your solution works. I guess I was so deep in the C code that it didn't occur to me to include the fix in the rust code. I'll make that change and write a few more tests to confirm that other characters trigger this bug as well.