quickwit-oss / tantivy-py

Python bindings for Tantivy
MIT License
273 stars 63 forks source link

Date Range query produces "ValueError: Syntax Error". #55

Closed Sidhant29 closed 1 year ago

Sidhant29 commented 2 years ago

Hi, While trying to filter using date ranges, I get a Syntax Error. I have gone through all the queryParser docs in tantivy to see if I had a formatting issue. The following code demonstrates the problem. Simply copy and paste the python code to reproduce:

from datetime import datetime
import os
import tantivy

schema_builder = tantivy.SchemaBuilder()
schema_builder.add_text_field("title", stored=True)
schema_builder.add_date_field("date_published", stored=True, indexed=True)
schema_builder.add_text_field("body", stored=True)
schema = schema_builder.build()

# Creating our index (in current working directory)
index = tantivy.Index(schema, path=os.getcwd() + '/index')

# Adding all the data.
writer = index.writer()
date = datetime(2022, 8, 2)
writer.add_document(tantivy.Document(
    date_published = date,
    title="The Old Man and the Sea",
    body="He was an old man who fished alone in a skiff in \
    the Gulf Stream and he had gone eighty-four days \
    now without taking a fish."
))
writer.commit()

index.reload()
searcher = index.searcher()

query = index.parse_query('date_published:[2002-10-02T15:00:00Z TO 2023-10-02T18:00:00Z]', ['date_published'])
print(query)
result = searcher.search(query, count=True, limit=5)

This produces the following error: image

To confirm that I didn't have a formatting issue for query string, I recreated the code in rust, and it worked fine.

use tantivy::schema::*;
use tantivy::collector::TopDocs;
use tantivy::doc;
use tantivy::Index;
use tantivy::query::QueryParser;
use tantivy::Score;
use tantivy::{DocAddress, DateTime};
use std::env::current_dir;

fn main() {
    let mut schema_builder = Schema::builder();
    let title = schema_builder.add_text_field("title", TEXT | STORED);
    let body = schema_builder.add_text_field("body", TEXT);
    let num_options: NumericOptions = NumericOptions::default();
    let date = schema_builder.add_date_field("date_created", num_options | STORED | INDEXED);
    let schema = schema_builder.build();
    let mut index_path = current_dir().unwrap();
    index_path.push("index");
    let index = Index::create_in_dir(&index_path, schema.clone()).unwrap();

    let mut index_writer = index.writer(100_000_000).unwrap();

    let mut doc = doc!(
        title => "The Old Man and the Sea",
        body => "He was an old man who fished alone in a skiff in \
                the Gulf Stream and he had gone eighty-four days \
                now without taking a fish.",
    );

    doc.add_date(date, DateTime::from_unix_timestamp(1659423709));
    index_writer.add_document(doc).unwrap();

    index_writer.commit().unwrap();

    let reader = index.reader().unwrap();
    let searcher = reader.searcher();
    let query_parser = QueryParser::for_index(&index, vec![title, body]);

    let query = query_parser.parse_query("date_created:[2002-10-02T15:00:00Z TO 2023-10-02T18:00:00Z]").unwrap();
    let top_docs: Vec<(Score, DocAddress)> =
    searcher.search(&query, &TopDocs::with_limit(10)).unwrap();

    for (_score, doc_address) in top_docs {
        let retrieved_doc = searcher.doc(doc_address).unwrap();
        println!("{}", schema.to_json(&retrieved_doc));
    }
}

This has the output as : image

Not sure if this is a bug, or some error from my side, Would really appreciate some help here.

cjrh commented 2 years ago

Working against current tantivy-py main, I don't get the same error:

$ python issue55.py
Query(RangeQuery { field: Field(1), value_type: Date, left_bound: Included([128, 0, 0, 0, 61, 155, 9, 240]), right_bound: Included([128,
0, 0, 0, 101, 27, 5, 32]) })
SearchResult(hits: [(1, DocAddress { segment_ord: 0, doc: 0 }), (1, DocAddress { segment_ord: 1, doc: 0 }), (1, DocAddress { segment_ord:
 2, doc: 0 })], count: 3)

I did modify the python reproducer in two ways:

Here is my full issue55.py:

from datetime import datetime
import os
import tantivy

schema_builder = tantivy.SchemaBuilder()
schema_builder.add_text_field("title", stored=True)
schema_builder.add_date_field("date_published", stored=True, indexed=True)
schema_builder.add_text_field("body", stored=True)
schema = schema_builder.build()

# Creating our index (in current working directory)
index_path = os.getcwd() + '/index'
os.makedirs(index_path, exist_ok=True)
index = tantivy.Index(schema, path=index_path)

# Adding all the data.
writer = index.writer()
date = datetime(2022, 8, 2)
writer.add_document(tantivy.Document(
    date_published = date,
    title="The Old Man and the Sea",
    body="He was an old man who fished alone in a skiff in \
    the Gulf Stream and he had gone eighty-four days \
    now without taking a fish."
))
writer.commit()

index.reload()
searcher = index.searcher()

query = index.parse_query('date_published:[2002-10-02T15:00:00Z TO 2023-10-02T18:00:00Z]', ['date_published'])
print(query)
result = searcher.search(query, count=True, limit=5)
print(result)
cjrh commented 2 years ago

Perhaps all this means is that a new release of tantivy-py is needed.

saroh commented 2 years ago

Perhaps all this means is that a new release of tantivy-py is needed.

Very nice bug report, I don't think I can solve anything more than has been done here but just to add a bit of context:

You're right, the current tantivy* on pipy are outdated, tantivy is 0.13 https://pypi.org/project/tantivy/#history. I'm unsure who has ownership of these packages on pipy, but as you've found out for the time being I'd say you have to build from scratch a 0.17 version from the git repo using pip install git+https://github.com/quickwit-oss/tantivy-py/.

It might be worth it to host your own pip repo if it's some dependency you use often in your builds.

Probably needs discussion! See #49

kapilt commented 2 years ago

it looks like its a single maintainer on pypi re @poljar

cjrh commented 1 year ago

@Sidhant29 now that a new version of tantivy-py is out, does this status change?

Sidhant29 commented 1 year ago

@Sidhant29 now that a new version of tantivy-py is out, does this status change?

Confirmed new version v0.20.1 on installation through pip. The new version resolves this issue. Closing this issue.