huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
19.16k stars 2.67k forks source link

Documentation: wiki_dpr Dataset has no metric_type for Faiss Index #6011

Closed YichiRockyZhang closed 1 year ago

YichiRockyZhang commented 1 year ago

Describe the bug

After loading wiki_dpr using:

ds = load_dataset(path='wiki_dpr', name='psgs_w100.multiset.compressed', split='train')
print(ds.get_index("embeddings").metric_type) # prints nothing because the value is None

the index does not have a defined metric_type. This is an issue because I do not know how the scores are being computed for get_nearest_examples().

Steps to reproduce the bug

System: Python 3.9.16, Transformers 4.30.2, WSL

After loading wiki_dpr using:

ds = load_dataset(path='wiki_dpr', name='psgs_w100.multiset.compressed', split='train')
print(ds.get_index("embeddings").metric_type) # prints nothing because the value is None

the index does not have a defined metric_type. This is an issue because I do not know how the scores are being computed for get_nearest_examples().

from transformers import DPRQuestionEncoder, DPRContextEncoder, DPRQuestionEncoderTokenizer, DPRContextEncoderTokenizer

tokenizer = DPRQuestionEncoderTokenizer.from_pretrained("facebook/dpr-question_encoder-multiset-base")
encoder = DPRQuestionEncoder.from_pretrained("facebook/dpr-question_encoder-multiset-base")

def encode_question(query, tokenizer=tokenizer, encoder=encoder):
    inputs = tokenizer(query, return_tensors='pt')
    question_embedding = encoder(**inputs)[0].detach().numpy()
    return question_embedding

def get_knn(query, k=5, tokenizer=tokenizer, encoder=encoder, verbose=False):
    enc_question = encode_question(query, tokenizer, encoder)
    topk_results = ds.get_nearest_examples(index_name='embeddings',
                                           query=enc_question,
                                          k=k)

    a = torch.tensor(enc_question[0]).reshape(768)
    b = torch.tensor(topk_results.examples['embeddings'][0])
    print(a.shape, b.shape)
    print(torch.dot(a, b))
    print((a-b).pow(2).sum())

    return topk_results

The FAISS documentation suggests the metric is usually L2 distance (without the square root) or the inner product. I compute both for the sample query:

query = """ it catapulted into popular culture along with a line of action figures and other toys by Bandai.[2] By 2001, the media franchise had generated over $6 billion in toy sales.
Despite initial criticism that its action violence targeted child audiences, the franchise has been commercially successful."""
get_knn(query,k=5)

Here, I get dot product of 80.6020 and L2 distance of 77.6616 and

NearestExamplesResults(scores=array([76.20431 , 75.312416, 74.945404, 74.866394, 74.68506 ],
      dtype=float32), examples={'id': ['3081096', '2004811', '8908258', '9594124', '286575'], 'text': ['actors, resulting in the "Power Rangers" franchise which has continued since then into sequel TV series (with "Power Rangers Beast Morphers" set to premiere in 2019), comic books, video games, and three feature films, with a further cinematic universe planned. Following from the success of "Power Rangers", Saban acquired the rights to more of Toei\'s library, creating "VR Troopers" and "Big Bad Beetleborgs" from several Metal Hero Series shows and "Masked Rider" from Kamen Rider Series footage. DIC Entertainment joined this boom by acquiring the rights to "Gridman the Hyper Agent" and turning it into "Superhuman Samurai Syber-Squad". In 2002,', 

Doing k=1 indicates the higher the outputted number, the better the match, so the metric should not be L2 distance. However, my manually computed inner product (80.6) has a discrepancy with the reported (76.2). Perhaps, this has to do with me using the compressed embeddings?

Expected behavior

ds = load_dataset(path='wiki_dpr', name='psgs_w100.multiset.compressed', split='train')
print(ds.get_index("embeddings").metric_type) # METRIC_INNER_PRODUCT

Environment info

mariosasko commented 1 year ago

Hi! You can do ds.get_index("embeddings").faiss_index.metric_type to get the metric type and then match the result with the FAISS metric enum (should be L2).

YichiRockyZhang commented 1 year ago

Ah! Thank you for pointing this out. FYI: the enum indicates it's using the inner product. Using torch.inner or torch.dot still produces a discrepancy compared to the built-in score. I think this is because of the compression/quantization that occurs with the FAISS index.