texttron / tevatron

Tevatron - A flexible toolkit for neural retrieval research and development.
http://tevatron.ai
Apache License 2.0
487 stars 92 forks source link

Is there a difference between query and passage encoder? #34

Closed gzerveas closed 1 year ago

gzerveas commented 2 years ago

I understand the need to keep them separate, and that they have different weights after training on MSMARCO, but is there a difference in how they handle the input sequence? In specific, when calling model = AutoModel.from_pretrained('co-condenser-marco') , what do we get, a query or passage encoder? Or both, and we need to pass a special token type ID token depending on whether a query or passage is given as input? What does the option python -m tevatron.driver.encode --encode_is_qry practically do, compared to omitting it?

MXueguang commented 2 years ago

Hi @gzerveas

  1. Some dense retriever models uses untie parameters, where query and passage encoders do not share parameters. e.g. DPR.
  2. Some models uses tie parameters, where query and passage encoders share same parameters. e.g. co-condenser-marco

model = AutoModel.from_pretrained('co-condenser-marco') will get the model backbone for both query encoder and passage encoder as it uses tie parameters.

the --encode_is_qry can decide to use a). query encoder in an untie parameters case, b). text input length for query encoder

gzerveas commented 2 years ago

Hi @MXueguang , thank you very much for your reply, and for sharing your nice code!

I am trying to do a simple evaluation of Cocodenser: I use a Cocodenser encoder to encode the MS MARCO passage collection, then encode some queries (e.g. the ones in queries.dev.small.tsv), compute the inner product between these query embeddings and the collection embeddings, sort these scores for each query and get a ranking.

When initializing the CoCodenser model using model = AutoModel.from_pretrained('co-condenser-marco'), I get the following warning: Some weights of BertModel were not initialized from the model checkpoint at Luyu/co-condenser-marco and are newly initialized: ['bert. pooler.dense.weight', 'bert.pooler.dense.bias'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. Is this okay/expected?

To obtain the embedding of the query/passage, I simply select the first emb. vector of the model's output representation (i.e. the embedding corresponding to [CLS]), like this:

output = model(**batch)
embeddings = output[0][:, 0, :]  # the first [0] selects 'encoder_hidden_states' in the HuggingFace model output format)

The problem is that I only get an MRR@10 of about ~0.14, instead of ~0.4, and I am trying to understand why. Is there something that I am doing wrong when using the pretrained Cocodenser model?

MXueguang commented 2 years ago

Hi @gzerveas, The warning message is expected as we don't need pooler from bert. and the way you get CLS embeddings should be correct.

Could you double check which checkpoint you are using? I think you should use Luyu/co-condenser-marco-retriever instead of Luyu/co-condenser-marco

gzerveas commented 2 years ago

Thank you very much for the prompt reply! I see, Luyu/co-condenser-marco is probably the model pre-trained on the MS MARCO collection through MLM and the contrastive objective, and -retriever is the model actually fine-tuned for retrieval.

Using Luyu/co-condenser-marco-retriever I was able to get 0.366 MRR@10 on MS MARCO dev.small, which is great. However, this is exactly the number reported for the Condenser model (in the Condenser and Cocondenser papers), while the MRR@10 reported for the CoCondenser model is 0.382. Are you certain that Luyu/co-condenser-marco-retriever is a checkpoint of Cocodenser and not Condenser?

MXueguang commented 2 years ago

Are you certain that Luyu/co-condenser-marco-retriever is a checkpoint of Cocodenser and not Condenser.

Yes.

Is the corpus you use align with the document here? https://github.com/texttron/tevatron/tree/main/examples/coCondenser-marco The corpus in Tevatron has a title field

gzerveas commented 2 years ago

Thank you, I was using the official MS MARCO collection.tsv, and wasn't aware that RocketQA and (co)Condenser used a corpus with a title field. Using this new corpus, I was able to measure an improved performance of 0.369 MRR@10 on MS MARCO dev.small. This is still a bit lower than the 0.382 reported in the paper for coCondenser. Would you have any idea about what could explain the difference?

E.g., when tokenizing the documents of this enhanced , are you simply concatenating the title and main body tokens, or are you using a separator token in-between?

MXueguang commented 2 years ago

I think they use [SEP] to separate title and text during training and encoding: https://github.com/texttron/tevatron/blob/adf5ce45612332797931569d51cc5bcd8c1ac878/src/tevatron/preprocessor/preprocessor_tsv.py#L92

gzerveas commented 2 years ago

Thank you very much for confirming this. I tested tokenizing with the separator token in-between, and performance indeed jumped to 0.3813 MRR@10, which I find amazing (this improvement is very remarkable for such a minor change).

In any case, this is certainly close enough to the reported 0.382, and I find it very encouraging that it was independently verified using a separate code base. Thank you very much for your prompt responses, I am certain that this level of transparency and support helps advance research!