naver / splade

SPLADE: sparse neural search (SIGIR21, SIGIR22)
Other
710 stars 79 forks source link

Zero-dimension query embedding #31

Closed adri1wald closed 1 year ago

adri1wald commented 1 year ago

In the notebook I made some modifications and I get back a zero-dimensional embedding. Specifically I wanted to see the bow representation of a quoted search query using the efficient-splade models. Is it expected for the model to sometimes return zero-dimensional embeddings? Without the quotes it generates an expected representation.

model_type_or_dir = "naver/efficient-splade-V-large-query"
q_model_type_or_dir = "naver/efficient-splade-V-large-doc"

# loading model and tokenizer

model = Splade(model_type_or_dir, q_model_type_or_dir, agg="max")
model.eval()
tokenizer = AutoTokenizer.from_pretrained(q_model_type_or_dir)
reverse_voc = {v: k for k, v in tokenizer.vocab.items()}

# example document from MS MARCO passage collection (doc_id = 8003157)

query = '"a big fat potato"'

# now compute the document representation
with torch.no_grad():
    inputs = tokenizer(query, return_tensors="pt")
    print(inputs)
    query_rep = model(q_kwargs=inputs)["q_rep"].squeeze()  # (sparse) doc rep in voc space, shape (30522,)

# get the number of non-zero dimensions in the rep:
col = torch.nonzero(query_rep).squeeze().cpu().tolist()
print("number of actual dimensions: ", len(col))

# now let's inspect the bow representation:
weights = query_rep[col].cpu().tolist()
d = {k: v for k, v in zip(col, weights)}
sorted_d = {k: v for k, v in sorted(d.items(), key=lambda item: item[1], reverse=True)}
bow_rep = []
for k, v in sorted_d.items():
    bow_rep.append((reverse_voc[k], round(v, 2)))
print("SPLADE BOW rep:\n", bow_rep)
cadurosar commented 1 year ago

This is definitely not the expected behavior, but also you have inverted the document and query models, it should be

model_type_or_dir = "naver/efficient-splade-V-large-doc" q_model_type_or_dir = "naver/efficient-splade-V-large-query"

the q_model_type_or_dir refers to the query encoder. and the other which is the default to the document encoder. Properly using the query encoder would fix the problem.

That being said I would not expect the document encoder to remove all values from this sequence. Note that you can expect SPLADE to trim some documents if it considers that they don't have significant content, but this would not be an example that I would expect it to remove

adri1wald commented 1 year ago

Hey @cadurosar cheers for pointing that out! I haven't encountered any zero-dim embeddings since.