Closed asutherland closed 2 years ago
Hm, interesting. I hadn't really written this with shell parsing behaviour in mind. In particular even key:value'quoted'
will parse differently in this library vs in a shell. This passes:
assert_eq!(parse(r#"key:value'quoted'"#).terms, &[Term::new(false, Some("key"), "value'quoted'")]);
but with shell parsing the single-quotes would basically get dropped and value
would get concatenated with quoted
. Shell parsing also wouldn't handle unbalanced quotes like this: value'quoted
so I don't think trying to match shell parsing is the right driving principle here.
Ah, yeah, it looks a lot like I'm making the case for shell parsing here!
Let me re-express this as:
'
is a very horizontally small target when using the mouse to re-position or select.'I made a typ'o
parse as a single term with exactly that value than as 2 tokens as it does now, and that would be consistent with your example assertion as well. This can then be detected as an error and potentially automatically corrected (by the caller), whereas the current parse result set doesn't provide enough information to the caller to take action.So in general I don't really want to return an error back to the caller - I feel like everything provided should have a well-defined parse result. On the face of it I agree that 'I made a typ'o
should parse as a single term. When I went to implement this change I came up with other cases that are less clear. What about 'I made a typ'o'graphical error'
? Should this also parse as a single term? If yes, then that's a fundamental change that is effectively bringing us to shell parsing. If not, why not, and how should it parse?
I finally have learned to eat food so I'm not writing a response on hunger brain and adding confusion to things!
I agree everything should have a well-defined parse result. My evolved pitch is:
Showing results for "mozilla firefox", search instead for "mzilla firefox"
.For my root example:
'I made a typ'o 'oh no!'
'I made a typo' 'oh no!'
'I made a typ' o 'oh no!'
. (Note that this technically has the same edit distance, but the real situation was actually involving an e
and the single-quote '
where the typo can be modeled as a sequence problem and that this suggestion would require the user to have typed 2 space characters. One rationale to continue working within edit distance would be to say that the o
should be quoted in the best practice.)For your example:
'I made a typ'o'graphical error'
"I made a typ'o'graphical error"
(edit distance of 2 though)'I made a typ\'o\'graphical error'
(edit distance of 2 though)'I made a typ' 'o' 'graphical error'
(I added extra quotes here to try and hide the fact that just adding spaces is an edit distance of 2 as well.)Note that I propose this as a way of guiding the decision process here. In practice I think this would amount to specialized heuristics and not some kind of general purposes magic parser thing that thinks about edit distances. The specific heuristic would be:
If the approach sounds attractive but that it's a lot to try and prototype given that it seems like the result might end up a mess that's too ugly to land, I'm happy to be the one to make the mess and you can evaluate the specific solution as well as the meta of whether the approach needs to scale to other more complicated edge-cases. I don't know that one would want to try and handle all possible permutations of colons and quoting; I'm mainly just concerned about the quoting.
I agree the approach sounds attractive, but I feel that the implementation would involve almost a complete rewrite? It feels like you want to track a lot more state than is reasonable to do in the way I structured the state machine. But that's ok - I wrote this library with the intent of using it in searchfox, and it has no value at the moment outside of that use case. If it needs to be rewritten, then so be it :)
That being said I'm also happy to transfer ownership of this repo to you or into the mozsearch org, if that makes sense. For now please go ahead and prototype away, and depending on how much of a change it ends up being we can figure out what to do with it.
I have an attempt up at https://github.com/staktrace/query-parser/pull/4 that addresses the quoting use-cases. I simplified the analysis somewhat for a 2-character look-ahead since what I proposed above didn't actually seem to need to do more work than that.
Thanks for landing #4! I've now put https://github.com/staktrace/query-parser/pull/5 up and during the process of trying to write up some examples for the README I realized that we didn't talk about contractions too much and how they can get caught up in the transposition fix (when badly quoted). As I note in the pull request, I think the transposition firing actually mitigates the damage the poor quoting might do to a naive query consumer, but this is largely hand-waving. The contraction problem is hard; ideally all users are already programmers and trained to use double-quotes as much as possible because of this issue.
With both PR's landed, I think we're good from my perspective, so I'm going to mark this closed. Thanks so much for your thoughtful and prompt responses! I'm hoping to expose an MVP of the new searchfox "query" endpoint based on this library in the next week (without obsoleting the "search" endpoint at all) for diagramming via menus and maybe even some (opt-in) super sidebar magic.
Nice, I'm looking forward to it :)
The current behavior for quoting is reflected in the following test case:
That is,
'foo'bar
parses as 2 tokens, "foo" and "bar" because a closing quote immediately resets parsing state without requiring any white-space.This differs from shell behavior where quoting does not terminate tokens.
This just came up in searchfox where a query
calls-between:'FrameLoader::checkLoadComplet'e calls-between:'dispatchWindowEvent'
took out the (buggy/misconfigured) server because the accidental transposition of the letter "e" outside its intended location resulted in the query engine deciding that it needed to search for all symbols starting with the letter "e" (and I had misconfigured things by leaving the default limit at 0 thinking future me would add it in at the query mapping layer).This seems like the typo that will happen a lot. It seems like the parser should either mimic shell parsing behavior here in all cases, or having this be a case that is optionally an error but by default self-corrects.