Open lvjg opened 3 weeks ago
If you have time, we would appreciate a working code snippet that is easy to run to reproduce the issue. See http://www.sscce.org/
I only started using Tantivy-py this weekend so I might be doing something wrong, but this doesn't behave as expected from reading the API:
import tantivy
schema_builder = tantivy.SchemaBuilder()
schema_builder.add_text_field("doc", stored=True)
schema = schema_builder.build()
index = tantivy.Index(schema)
writer = index.writer()
docs = ["1", "2", "3"]
for d in docs:
writer.add_document(
tantivy.Document(
doc=d
)
)
writer.commit()
index.reload()
searcher = index.searcher()
print("Before delete:")
for d in docs:
query = tantivy.Query.term_query(schema, "doc", d)
top_docs = searcher.search(query)
print(top_docs)
print("Deleting 1 and 2")
print(writer.delete_documents("doc", "1"))
print(writer.delete_documents("doc", "2"))
writer.commit()
index.reload()
print("After delete:")
for d in docs:
query = tantivy.Query.term_query(schema, "doc", d)
top_docs = searcher.search(query)
print(top_docs)
Output is:
Before delete:
SearchResult(hits: [(0.9808292, DocAddress { segment_ord: 0, doc: 0 })], count: 1)
SearchResult(hits: [(0.9808292, DocAddress { segment_ord: 2, doc: 0 })], count: 1)
SearchResult(hits: [(0.9808292, DocAddress { segment_ord: 1, doc: 0 })], count: 1)
Deleting 1 and 2
8
9
After delete:
SearchResult(hits: [(0.9808292, DocAddress { segment_ord: 0, doc: 0 })], count: 1)
SearchResult(hits: [(0.9808292, DocAddress { segment_ord: 2, doc: 0 })], count: 1)
SearchResult(hits: [(0.9808292, DocAddress { segment_ord: 1, doc: 0 })], count: 1)
Refreshing the searcher gives the expected behaviour:
print("Deleting 1 and 2")
print(writer.delete_documents("doc", "1"))
print(writer.delete_documents("doc", "2"))
writer.commit()
index.reload()
**searcher = index.searcher()**
print("After delete:")
for d in docs:
query = tantivy.Query.term_query(schema, "doc", d)
top_docs = searcher.search(query)
print(top_docs)
gives:
SearchResult(hits: [(0.9808292, DocAddress { segment_ord: 2, doc: 0 })], count: 1)
SearchResult(hits: [(0.9808292, DocAddress { segment_ord: 0, doc: 0 })], count: 1)
SearchResult(hits: [(0.9808292, DocAddress { segment_ord: 1, doc: 0 })], count: 1)
Deleting 1 and 2
8
9
After delete:
SearchResult(hits: [], count: 0)
SearchResult(hits: [], count: 0)
SearchResult(hits: [(0.28768212, DocAddress { segment_ord: 0, doc: 0 })], count: 1)
Maybe just user error, then.
Both I and the original reporter were bitten by term-queries not matching values with characters like -
and _
, so the documents weren't actually deleted.
If you change 1,2,3
to 1, 2_2, 3-3
in the example above, only the first document works as expected. Searching for the documents with a phrase query finds the others the attempted delete.
This illustrates it:
import tantivy
schema_builder = tantivy.SchemaBuilder()
schema_builder.add_text_field("doc", stored=True)
schema = schema_builder.build()
index = tantivy.Index(schema)
writer = index.writer()
docs = ["11", "1-2", "1_3", "1.4"]
for d in docs:
writer.add_document(
tantivy.Document(
doc=d
)
)
writer.commit()
index.reload()
searcher = index.searcher()
print("Before delete:")
for d in docs:
query = index.parse_query("doc:{}".format(d), ["doc"])
top_docs = searcher.search(query)
print(top_docs)
print("Deleting everything:")
for d in docs:
print(writer.delete_documents("doc", d))
writer.commit()
index.reload()
searcher = index.searcher()
print("After delete:")
for d in docs:
query = index.parse_query("doc:{}".format(d), ["doc"])
top_docs = searcher.search(query)
print(top_docs)
which gives:
Before delete:
SearchResult(hits: [(1.4599355, DocAddress { segment_ord: 1, doc: 0 })], count: 1)
SearchResult(hits: [(1.474477, DocAddress { segment_ord: 0, doc: 0 })], count: 1)
SearchResult(hits: [(1.474477, DocAddress { segment_ord: 3, doc: 0 })], count: 1)
SearchResult(hits: [(1.474477, DocAddress { segment_ord: 2, doc: 0 })], count: 1)
Deleting everything:
10
11
12
13
After delete:
SearchResult(hits: [], count: 0)
SearchResult(hits: [(1.1143606, DocAddress { segment_ord: 0, doc: 0 })], count: 1)
SearchResult(hits: [(1.1143606, DocAddress { segment_ord: 2, doc: 0 })], count: 1)
SearchResult(hits: [(1.1143606, DocAddress { segment_ord: 1, doc: 0 })], count: 1)
Just a quick follow up that changing the tokenizer to raw
gives the expected behavior
schema_builder.add_text_field("doc", stored=True, tokenizer_name="raw")
and everything is deleted:
SearchResult(hits: [], count: 0)
SearchResult(hits: [], count: 0)
SearchResult(hits: [], count: 0)
SearchResult(hits: [], count: 0)
@Fudge Thanks for looking at this ❤️
The first thing I am most interested to know is whether the deletion behaviour in tantivy-py behaves differently than the upstream tantivy crate. This might be tricky for you to investigate if you're not used to Rust.
I haven't looked into this yet but I've been following your investigation. I wonder whether we can:
delete_documents()
, Does this sound like it would fix the issue?
I've been using delete_documents
in tantivy always on int fields (like a doc_id
field) and that always works. So yeah sounds like the tokenizer is the issue here.
I don't think applying the tokenizer to the value would give the expected behavior in this case, as trying to delete version-1.1.0
would also delete version-1.10
and version-11.0
for example.
delete_documents()
on tokenized text fields is not intuitive, and should come with a warning. :-)
add three document,doc_id = test-1, test-2, test-3; use writer.delete_documents(field_name="doc_id", field_value="test-1") writer.commit() writer.wait_merging_threads() index.reload()
test-1 can still be found through search...