Handling of letters after a closing quote but without any whitespace is not resilient in the face of typos

asutherland commented 2 years ago

The current behavior for quoting is reflected in the following test case:

assert_eq!(
  parse("'I made a typ'o 'oh no!'").terms,
  &[
    Term::new(false, None, "I made a typ"),
    Term::new(false, None, "o"),
    Term::new(false, None, "oh no!")
  ]);

That is, 'foo'bar parses as 2 tokens, "foo" and "bar" because a closing quote immediately resets parsing state without requiring any white-space.

This differs from shell behavior where quoting does not terminate tokens.

 $ echo 'foo'bar | wc
      1       1       7
 $ echo 'foo' bar | wc
      1       2       8

This just came up in searchfox where a query calls-between:'FrameLoader::checkLoadComplet'e calls-between:'dispatchWindowEvent' took out the (buggy/misconfigured) server because the accidental transposition of the letter "e" outside its intended location resulted in the query engine deciding that it needed to search for all symbols starting with the letter "e" (and I had misconfigured things by leaving the default limit at 0 thinking future me would add it in at the query mapping layer).

This seems like the typo that will happen a lot. It seems like the parser should either mimic shell parsing behavior here in all cases, or having this be a case that is optionally an error but by default self-corrects.

staktrace commented 2 years ago

Hm, interesting. I hadn't really written this with shell parsing behaviour in mind. In particular even key:value'quoted' will parse differently in this library vs in a shell. This passes:

assert_eq!(parse(r#"key:value'quoted'"#).terms, &[Term::new(false, Some("key"), "value'quoted'")]);

but with shell parsing the single-quotes would basically get dropped and value would get concatenated with quoted. Shell parsing also wouldn't handle unbalanced quotes like this: value'quoted so I don't think trying to match shell parsing is the right driving principle here.

asutherland commented 2 years ago

Ah, yeah, it looks a lot like I'm making the case for shell parsing here!

Let me re-express this as:

I think transposition typos like in the test case are very likely. The typo that motivated this issue was someone else's, but I personally make a lot of emergent transposition-related typos. Sometimes it's due to physical typing order, sometimes it's a mousing error because ' is a very horizontally small target when using the mouse to re-position or select.
Shell quoting is actually super-weird, but I think shells do get it right that the white-space is the significant delimiter. (Noting that of course quoting inherently wants to capture white-space, which is why we need a parser in the first place! :)
- That said, shell quoting makes some level of sense because escaping text is a nightmare and shell cases frequently involve wanting to interpolate variables and also not have to backslash everything.
I think the example I provide is one where the priority of white-space means that it makes more sense to have 'I made a typ'o parse as a single term with exactly that value than as 2 tokens as it does now, and that would be consistent with your example assertion as well. This can then be detected as an error and potentially automatically corrected (by the caller), whereas the current parse result set doesn't provide enough information to the caller to take action.

staktrace commented 2 years ago

So in general I don't really want to return an error back to the caller - I feel like everything provided should have a well-defined parse result. On the face of it I agree that 'I made a typ'o should parse as a single term. When I went to implement this change I came up with other cases that are less clear. What about 'I made a typ'o'graphical error'? Should this also parse as a single term? If yes, then that's a fundamental change that is effectively bringing us to shell parsing. If not, why not, and how should it parse?

asutherland commented 2 years ago

I finally have learned to eat food so I'm not writing a response on hunger brain and adding confusion to things!

I agree everything should have a well-defined parse result. My evolved pitch is:

Given that these queries are intended to be written by humans, typos will occur.
It could be advantageous for the parse syntax to ensure that there's sufficient redundancy so that the edit distance between best practice representation of queries relating to quoting always has an edit distance greater than 1.
The parser could optionally:
- recognize parses that are adjacent (edit distance of 1) to best practice and flag the impacted terms as suspect
- suggest an alternate parse and provide the alternate corrected string which would map to the alternate parse
This would enable common UX idioms of:
- When I search for "mzilla firefox" in Google (no o), I get Showing results for "mozilla firefox", search instead for "mzilla firefox".
- Not auto-correct but putting little yellow or red squiggles under things.
The parser would still always be returning a result! It would just optionally be saying: "But maybe here's a better result and you can offer your user an option between the choices".

For my root example:

sketchy (example): 'I made a typ'o 'oh no!'
best practice suggestion: 'I made a typo' 'oh no!'
best practice not suggested: 'I made a typ' o 'oh no!'. (Note that this technically has the same edit distance, but the real situation was actually involving an e and the single-quote ' where the typo can be modeled as a sequence problem and that this suggestion would require the user to have typed 2 space characters. One rationale to continue working within edit distance would be to say that the o should be quoted in the best practice.)

For your example:

sketchy (example): 'I made a typ'o'graphical error'
best practice: "I made a typ'o'graphical error" (edit distance of 2 though)
best practice: 'I made a typ\'o\'graphical error' (edit distance of 2 though)
too far away best practice: 'I made a typ' 'o' 'graphical error' (I added extra quotes here to try and hide the fact that just adding spaces is an edit distance of 2 as well.)

Note that I propose this as a way of guiding the decision process here. In practice I think this would amount to specialized heuristics and not some kind of general purposes magic parser thing that thinks about edit distances. The specific heuristic would be:

I saw a quote character that should end this quoted block. Is the next character whitespace?
- Yes: Yay! This is a best practice, nothing more to do.
- No: Doh, scan the text until we encounter whitespace and keep track of whether we see any more matching quotes, how many character there were before the whitespace, as well as whether the last character before the whitespace is the same quote character we're looking at.
- If there was a matching closing quote, then suggest that the outer-quotes be changed to the other quote type or that the inner quotes be escaped. (If we've seen the other quote type already, then we must suggest escaping, which could be the discriminator.)
- If there was only one character, assume transposition, suggest swapping the characters.
- If there were multiple characters then maybe escaping the quote is most sane? It's an edit distance of 1, but I think it would also be fine to indicate the character was suspicious but it's not clear how to correct it.

If the approach sounds attractive but that it's a lot to try and prototype given that it seems like the result might end up a mess that's too ugly to land, I'm happy to be the one to make the mess and you can evaluate the specific solution as well as the meta of whether the approach needs to scale to other more complicated edge-cases. I don't know that one would want to try and handle all possible permutations of colons and quoting; I'm mainly just concerned about the quoting.

staktrace commented 2 years ago

I agree the approach sounds attractive, but I feel that the implementation would involve almost a complete rewrite? It feels like you want to track a lot more state than is reasonable to do in the way I structured the state machine. But that's ok - I wrote this library with the intent of using it in searchfox, and it has no value at the moment outside of that use case. If it needs to be rewritten, then so be it :)

That being said I'm also happy to transfer ownership of this repo to you or into the mozsearch org, if that makes sense. For now please go ahead and prototype away, and depending on how much of a change it ends up being we can figure out what to do with it.

asutherland commented 2 years ago

I have an attempt up at https://github.com/staktrace/query-parser/pull/4 that addresses the quoting use-cases. I simplified the analysis somewhat for a 2-character look-ahead since what I proposed above didn't actually seem to need to do more work than that.

asutherland commented 2 years ago

Thanks for landing #4! I've now put https://github.com/staktrace/query-parser/pull/5 up and during the process of trying to write up some examples for the README I realized that we didn't talk about contractions too much and how they can get caught up in the transposition fix (when badly quoted). As I note in the pull request, I think the transposition firing actually mitigates the damage the poor quoting might do to a naive query consumer, but this is largely hand-waving. The contraction problem is hard; ideally all users are already programmers and trained to use double-quotes as much as possible because of this issue.

asutherland commented 2 years ago

With both PR's landed, I think we're good from my perspective, so I'm going to mark this closed. Thanks so much for your thoughtful and prompt responses! I'm hoping to expose an MVP of the new searchfox "query" endpoint based on this library in the next week (without obsoleting the "search" endpoint at all) for diagramming via menus and maybe even some (opt-in) super sidebar magic.

staktrace commented 2 years ago

Nice, I'm looking forward to it :)

staktrace / query-parser

Handling of letters after a closing quote but without any whitespace is not resilient in the face of typos #3