quickwit-oss / tantivy-py

Python bindings for Tantivy
MIT License
244 stars 62 forks source link

writer.delete_documents not work #297

Open lvjg opened 3 weeks ago

lvjg commented 3 weeks ago

add three document,doc_id = test-1, test-2, test-3; use writer.delete_documents(field_name="doc_id", field_value="test-1") writer.commit() writer.wait_merging_threads() index.reload()

test-1 can still be found through search...

cjrh commented 3 weeks ago

If you have time, we would appreciate a working code snippet that is easy to run to reproduce the issue. See http://www.sscce.org/

Fudge commented 2 weeks ago

I only started using Tantivy-py this weekend so I might be doing something wrong, but this doesn't behave as expected from reading the API:

import tantivy

schema_builder = tantivy.SchemaBuilder()
schema_builder.add_text_field("doc", stored=True)
schema = schema_builder.build()
index = tantivy.Index(schema)

writer = index.writer()

docs = ["1", "2", "3"]
for d in docs:
    writer.add_document(
        tantivy.Document(
            doc=d
        )
    )

writer.commit()
index.reload()
searcher = index.searcher()

print("Before delete:")
for d in docs:
    query = tantivy.Query.term_query(schema, "doc", d)
    top_docs = searcher.search(query)
    print(top_docs)

print("Deleting 1 and 2")
print(writer.delete_documents("doc", "1"))
print(writer.delete_documents("doc", "2"))
writer.commit()
index.reload()

print("After delete:")
for d in docs:
    query = tantivy.Query.term_query(schema, "doc", d)
    top_docs = searcher.search(query)
    print(top_docs)

Output is:

Before delete:
SearchResult(hits: [(0.9808292, DocAddress { segment_ord: 0, doc: 0 })], count: 1)
SearchResult(hits: [(0.9808292, DocAddress { segment_ord: 2, doc: 0 })], count: 1)
SearchResult(hits: [(0.9808292, DocAddress { segment_ord: 1, doc: 0 })], count: 1)
Deleting 1 and 2
8
9
After delete:
SearchResult(hits: [(0.9808292, DocAddress { segment_ord: 0, doc: 0 })], count: 1)
SearchResult(hits: [(0.9808292, DocAddress { segment_ord: 2, doc: 0 })], count: 1)
SearchResult(hits: [(0.9808292, DocAddress { segment_ord: 1, doc: 0 })], count: 1)
Fudge commented 2 weeks ago

Refreshing the searcher gives the expected behaviour:

print("Deleting 1 and 2")
print(writer.delete_documents("doc", "1"))
print(writer.delete_documents("doc", "2"))
writer.commit()
index.reload()
**searcher = index.searcher()**

print("After delete:")
for d in docs:
    query = tantivy.Query.term_query(schema, "doc", d)
    top_docs = searcher.search(query)
    print(top_docs)

gives:

SearchResult(hits: [(0.9808292, DocAddress { segment_ord: 2, doc: 0 })], count: 1)
SearchResult(hits: [(0.9808292, DocAddress { segment_ord: 0, doc: 0 })], count: 1)
SearchResult(hits: [(0.9808292, DocAddress { segment_ord: 1, doc: 0 })], count: 1)
Deleting 1 and 2
8
9
After delete:
SearchResult(hits: [], count: 0)
SearchResult(hits: [], count: 0)
SearchResult(hits: [(0.28768212, DocAddress { segment_ord: 0, doc: 0 })], count: 1)

Maybe just user error, then.

Fudge commented 2 weeks ago

Both I and the original reporter were bitten by term-queries not matching values with characters like - and _, so the documents weren't actually deleted.

If you change 1,2,3 to 1, 2_2, 3-3 in the example above, only the first document works as expected. Searching for the documents with a phrase query finds the others the attempted delete.

This illustrates it:

import tantivy

schema_builder = tantivy.SchemaBuilder()
schema_builder.add_text_field("doc", stored=True)
schema = schema_builder.build()
index = tantivy.Index(schema)

writer = index.writer()

docs = ["11", "1-2", "1_3", "1.4"]
for d in docs:
    writer.add_document(
        tantivy.Document(
            doc=d
        )
    )

writer.commit()
index.reload()
searcher = index.searcher()

print("Before delete:")
for d in docs:
    query = index.parse_query("doc:{}".format(d), ["doc"])
    top_docs = searcher.search(query)
    print(top_docs)

print("Deleting everything:")
for d in docs:
    print(writer.delete_documents("doc", d))

writer.commit()
index.reload()
searcher = index.searcher()

print("After delete:")
for d in docs:
    query = index.parse_query("doc:{}".format(d), ["doc"])
    top_docs = searcher.search(query)
    print(top_docs)

which gives:

Before delete:
SearchResult(hits: [(1.4599355, DocAddress { segment_ord: 1, doc: 0 })], count: 1)
SearchResult(hits: [(1.474477, DocAddress { segment_ord: 0, doc: 0 })], count: 1)
SearchResult(hits: [(1.474477, DocAddress { segment_ord: 3, doc: 0 })], count: 1)
SearchResult(hits: [(1.474477, DocAddress { segment_ord: 2, doc: 0 })], count: 1)
Deleting everything:
10
11
12
13
After delete:
SearchResult(hits: [], count: 0)
SearchResult(hits: [(1.1143606, DocAddress { segment_ord: 0, doc: 0 })], count: 1)
SearchResult(hits: [(1.1143606, DocAddress { segment_ord: 2, doc: 0 })], count: 1)
SearchResult(hits: [(1.1143606, DocAddress { segment_ord: 1, doc: 0 })], count: 1)
Fudge commented 1 week ago

Just a quick follow up that changing the tokenizer to raw gives the expected behavior

schema_builder.add_text_field("doc", stored=True, tokenizer_name="raw")

and everything is deleted:

SearchResult(hits: [], count: 0)
SearchResult(hits: [], count: 0)
SearchResult(hits: [], count: 0)
SearchResult(hits: [], count: 0)
cjrh commented 1 week ago

@Fudge Thanks for looking at this ❤️

The first thing I am most interested to know is whether the deletion behaviour in tantivy-py behaves differently than the upstream tantivy crate. This might be tricky for you to investigate if you're not used to Rust.

I haven't looked into this yet but I've been following your investigation. I wonder whether we can:

Does this sound like it would fix the issue?

I've been using delete_documents in tantivy always on int fields (like a doc_id field) and that always works. So yeah sounds like the tokenizer is the issue here.

Fudge commented 1 week ago

I don't think applying the tokenizer to the value would give the expected behavior in this case, as trying to delete version-1.1.0 would also delete version-1.10 and version-11.0 for example.

delete_documents() on tokenized text fields is not intuitive, and should come with a warning. :-)