Annliteindexer results change every bootup within a jina flow

jina-ai / annlite

⚡ A fast embedded library for approximate nearest neighbor search

Apache License 2.0

216 stars 22 forks source link

Annliteindexer results change every bootup within a jina flow #223

Closed Mewral closed 1 year ago

Mewral commented 1 year ago

Hi, I'm currently using annlite indexer in a jina flow to do text embedding search. Every time I start flow the annlite indexer give me different result with the same data. Any idea about the problem?

jemmyshin commented 1 year ago

What did you mean different result ? Are you using /query endpoint to find the most relevant item?

Mewral commented 1 year ago

yes. I'm using /query to find the most relevant item, But I get different topk relevant results everytime I reload my database with the same query

JoanFM commented 1 year ago

I believe this could be reasonable given the non-deterministic nature of the underlying HNSW algorithm.

JoanFM commented 1 year ago

Can u provide a small example showing the specific problem?

Mewral commented 1 year ago

Yes. I think the HNSW is the reason, But I haven't proved it yet. I'll post some example tomorrow, thanks for the reply

Mewral commented 1 year ago

So basically my annlite query endpoint gives top10 relevant documents index like [1, 2, 3 ,4 ,5 ,6, 7, 8, 9, 10] .After I stop the flow and reboot it, It gives top10 relevant documents index like [1, 2, 3, 4, 12, 6, 7, 8, 9,10] and I didn't change any config. Perhaps It has something to do with my bootup. Log says I'm building annlite index from scratch every time I run my code. I do saw a snapshot_path to load previous state index, but I don't know how to set it.

jemmyshin commented 1 year ago

HNSW will be rebuilt every time if you didn't dump the index, so the result might be slight different. Could you try dump the indexer and then load and search?

Mewral commented 1 year ago

Yes I'll try, thx. Still I have one more question. Assuming that Query-A has top3 most relevant documents [1,2,3] and Query-B has top3 most relevant documents [5,6,7], I run Query-A with correct result [1,2,3] and Query-B with wrong result [5,8,7] and after reload annlite without dumped indexer I get Query-A with wrong result [1, 4, 3] Query-B with correct result [5, 6, 7]. They both change. So I haven't found a way to make them both right. Is this also a HNSW feature?

Mewral commented 1 year ago

NVM, I check the documents and solved it . closing it now