staktrace / query-parser

A search query parser
6 stars 2 forks source link

Handling of letters after a closing quote but without any whitespace is not resilient in the face of typos #3

Closed asutherland closed 2 years ago

asutherland commented 2 years ago

The current behavior for quoting is reflected in the following test case:

assert_eq!(
  parse("'I made a typ'o 'oh no!'").terms,
  &[
    Term::new(false, None, "I made a typ"),
    Term::new(false, None, "o"),
    Term::new(false, None, "oh no!")
  ]);

That is, 'foo'bar parses as 2 tokens, "foo" and "bar" because a closing quote immediately resets parsing state without requiring any white-space.

This differs from shell behavior where quoting does not terminate tokens.

 $ echo 'foo'bar | wc
      1       1       7
 $ echo 'foo' bar | wc
      1       2       8

This just came up in searchfox where a query calls-between:'FrameLoader::checkLoadComplet'e calls-between:'dispatchWindowEvent' took out the (buggy/misconfigured) server because the accidental transposition of the letter "e" outside its intended location resulted in the query engine deciding that it needed to search for all symbols starting with the letter "e" (and I had misconfigured things by leaving the default limit at 0 thinking future me would add it in at the query mapping layer).

This seems like the typo that will happen a lot. It seems like the parser should either mimic shell parsing behavior here in all cases, or having this be a case that is optionally an error but by default self-corrects.

staktrace commented 2 years ago

Hm, interesting. I hadn't really written this with shell parsing behaviour in mind. In particular even key:value'quoted' will parse differently in this library vs in a shell. This passes:

assert_eq!(parse(r#"key:value'quoted'"#).terms, &[Term::new(false, Some("key"), "value'quoted'")]);

but with shell parsing the single-quotes would basically get dropped and value would get concatenated with quoted. Shell parsing also wouldn't handle unbalanced quotes like this: value'quoted so I don't think trying to match shell parsing is the right driving principle here.

asutherland commented 2 years ago

Ah, yeah, it looks a lot like I'm making the case for shell parsing here!

Let me re-express this as:

staktrace commented 2 years ago

So in general I don't really want to return an error back to the caller - I feel like everything provided should have a well-defined parse result. On the face of it I agree that 'I made a typ'o should parse as a single term. When I went to implement this change I came up with other cases that are less clear. What about 'I made a typ'o'graphical error'? Should this also parse as a single term? If yes, then that's a fundamental change that is effectively bringing us to shell parsing. If not, why not, and how should it parse?

asutherland commented 2 years ago

I finally have learned to eat food so I'm not writing a response on hunger brain and adding confusion to things!

I agree everything should have a well-defined parse result. My evolved pitch is:

For my root example:

For your example:

Note that I propose this as a way of guiding the decision process here. In practice I think this would amount to specialized heuristics and not some kind of general purposes magic parser thing that thinks about edit distances. The specific heuristic would be:

If the approach sounds attractive but that it's a lot to try and prototype given that it seems like the result might end up a mess that's too ugly to land, I'm happy to be the one to make the mess and you can evaluate the specific solution as well as the meta of whether the approach needs to scale to other more complicated edge-cases. I don't know that one would want to try and handle all possible permutations of colons and quoting; I'm mainly just concerned about the quoting.

staktrace commented 2 years ago

I agree the approach sounds attractive, but I feel that the implementation would involve almost a complete rewrite? It feels like you want to track a lot more state than is reasonable to do in the way I structured the state machine. But that's ok - I wrote this library with the intent of using it in searchfox, and it has no value at the moment outside of that use case. If it needs to be rewritten, then so be it :)

That being said I'm also happy to transfer ownership of this repo to you or into the mozsearch org, if that makes sense. For now please go ahead and prototype away, and depending on how much of a change it ends up being we can figure out what to do with it.

asutherland commented 2 years ago

I have an attempt up at https://github.com/staktrace/query-parser/pull/4 that addresses the quoting use-cases. I simplified the analysis somewhat for a 2-character look-ahead since what I proposed above didn't actually seem to need to do more work than that.

asutherland commented 2 years ago

Thanks for landing #4! I've now put https://github.com/staktrace/query-parser/pull/5 up and during the process of trying to write up some examples for the README I realized that we didn't talk about contractions too much and how they can get caught up in the transposition fix (when badly quoted). As I note in the pull request, I think the transposition firing actually mitigates the damage the poor quoting might do to a naive query consumer, but this is largely hand-waving. The contraction problem is hard; ideally all users are already programmers and trained to use double-quotes as much as possible because of this issue.

asutherland commented 2 years ago

With both PR's landed, I think we're good from my perspective, so I'm going to mark this closed. Thanks so much for your thoughtful and prompt responses! I'm hoping to expose an MVP of the new searchfox "query" endpoint based on this library in the next week (without obsoleting the "search" endpoint at all) for diagramming via menus and maybe even some (opt-in) super sidebar magic.

staktrace commented 2 years ago

Nice, I'm looking forward to it :)