quickwit-oss / tantivy

Tantivy is a full-text search engine library inspired by Apache Lucene and written in Rust
MIT License
12.16k stars 673 forks source link

QueryParser does not support URL special characters in filter keys #1931

Open joepio opened 1 year ago

joepio commented 1 year ago

Describe the bug

I have a json field and want to filter through these objects. However, my JSON has HTTP URLs in its keys and values, like so:

{ "https://example.com/some_key":"https://example.com/some_value"}

Say we want to do a query filter that does the equivalent of some_key:some_value:

https://example.com/some_key:https://example.com/some_value

Parsing these URLs using QueryParser::parse_query creates issues, because the parser treats the : in the URL as a key-value separator.

Apparently you can escape special characters using \, but that didn't seem to work:

let q_escaped_colons = r#""https\:example:http\:somevalue""#;
let res = make_query_parser().parse_query(q_escaped_colons).unwrap();
// Parses "https" and "example" as phrase terms

So I think this feature is bugged, or perhaps I misunderstood it. Either way, it's not documented as far as I can see.

UPDATE: It does work if you escape the key, and use double quotes"" around the value:

let q_escaped_colons = r#"https\://example.com/test/bla:"https\://examplevalue.com/test""#;
let res = make_query_parser().parse_query(q_escaped_colons).unwrap();

Preffered solution

I suppose using quotes inside these filters (for keys and values) would be nice. It would be simpler for clients to implement than escaping a list of special characters.

e.g.: "https://example.com/some?complex&url":"http://example.com/1255"

If allowing double quote escapes is implemented, this should not panic with a SyntaxError:

    #[test]
    pub fn test_parse_query_with_http_url() {
        make_query_parser()
            .parse_query(r#""https://example.com/some?complex&url":"http://example.com/1255""#)
            .unwrap();
    }

Which version of tantivy are you using? 0.19.1

And just because it's not said enough: thank you so much for maintaining this awesome library!

joepio commented 1 year ago

For reference, there's a discord chat about this issue.

If you escape the special characters in the key, and use double quotes for the value, you can get it working:

let q_escaped_colons = r#"https\://example.com/test/bla:"https\://examplevalue.com/test""#;
let res = make_query_parser().parse_query(q_escaped_colons).unwrap();

It leads to a very ugly query, but it works!

joepio commented 1 year ago

Update 2: I also needed to escape the dots, as these are interpresed as paths.

So:

https\://example\.com:query
fulmicoton commented 1 year ago

Have you tried just quoting?

`url:"http://www..."

They are usually used for phrase queries but they can actually be used here. It will take your url as a whole and send it to the tokenizer.

If you get an error regarding the lack of positions let me know.

joepio commented 1 year ago

@fulmicoton quotes work for the value, but not for the key. Keys require escaping with \,at least with : and . Characters