pinecone-io / pinecone-python-client

The Pinecone Python client
https://www.pinecone.io/docs
Apache License 2.0
294 stars 79 forks source link

[Bug] Pinecone vector search by id returns incorrect vectors. #346

Open ayansengupta17 opened 5 months ago

ayansengupta17 commented 5 months ago

Is this a new bug in the Pinecone Python client?

Current Behavior

If you query pinecone index with a vector ID with topk=1, the returned vectors sometimes have different id. If you keep top_k > 1, sometimes the correct vector is found in positions k>1.

Expected Behavior

If I search using vector id, the whole point is to get the vector whose id matches the query. Then find other vectors with high similarity scores.

Steps To Reproduce

It's hard to provide a reproducible steps, because it happens sometimes. We see it happening a lot in our production environment. So I rather attach some relevant screenshots from the UI.

Screenshot 2024-05-11 at 21 24 33

checkour more examples https://community.pinecone.io/t/bug-pinecone-search-by-id-is-returning-incorrect-result/5554

Relevant log output

check https://community.pinecone.io/t/bug-pinecone-search-by-id-is-returning-incorrect-result/5554

Environment

- OS:
- Python:
- pinecone:

Additional Context

No response

zackproser commented 5 months ago

Hi @ayansengupta17,

Thank you for your post, and thank you for taking the time to get screenshots and file an issue on GitHub.

I’ve discussed this with the relevant teams to double-check, but this is actually not a bug!

Please see our guide on the Limitations of querying by ID to understand why this is happening.

If you want to ensure your results contain the vector you’re requesting by ID, you can use fetch instead, as outlined here.

I hope this helps!

Best, Zack

ayansengupta17 commented 5 months ago

@zackproser Thanks for pointing to the documentation. That was really helpful. I want to suggest two things here

likid1412 commented 4 months ago
  • When a user is querying vector by an ID it is expected behaviour to get that particulat vector as the first hit and then the nearestest neighbours as other hits.

I think @ayansengupta17 is right, it should be the first hit when query with the ID. Below is what I'm using in another vector database, it return the ID in first hit as I expected

image

ref: 向量数据库 基于 Doc ID 相似度检索-SDK 参考-文档中心-腾讯云