Closed gzerveas closed 1 year ago
Hi @gzerveas
co-condenser-marco
model = AutoModel.from_pretrained('co-condenser-marco')
will get the model backbone for both query encoder and passage encoder as it uses tie parameters.
the --encode_is_qry
can decide to use a). query encoder in an untie parameters case, b). text input length for query encoder
Hi @MXueguang , thank you very much for your reply, and for sharing your nice code!
I am trying to do a simple evaluation of Cocodenser: I use a Cocodenser encoder to encode the MS MARCO passage collection, then encode some queries (e.g. the ones in queries.dev.small.tsv
), compute the inner product between these query embeddings and the collection embeddings, sort these scores for each query and get a ranking.
When initializing the CoCodenser model using model = AutoModel.from_pretrained('co-condenser-marco')
, I get the following warning: Some weights of BertModel were not initialized from the model checkpoint at Luyu/co-condenser-marco and are newly initialized: ['bert. pooler.dense.weight', 'bert.pooler.dense.bias'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Is this okay/expected?
To obtain the embedding of the query/passage, I simply select the first emb. vector of the model's output representation (i.e. the embedding corresponding to [CLS]
), like this:
output = model(**batch)
embeddings = output[0][:, 0, :] # the first [0] selects 'encoder_hidden_states' in the HuggingFace model output format)
The problem is that I only get an MRR@10 of about ~0.14, instead of ~0.4, and I am trying to understand why. Is there something that I am doing wrong when using the pretrained Cocodenser model?
Hi @gzerveas, The warning message is expected as we don't need pooler from bert. and the way you get CLS embeddings should be correct.
Could you double check which checkpoint you are using?
I think you should use Luyu/co-condenser-marco-retriever
instead of Luyu/co-condenser-marco
Thank you very much for the prompt reply! I see, Luyu/co-condenser-marco
is probably the model pre-trained on the MS MARCO collection through MLM and the contrastive objective, and -retriever
is the model actually fine-tuned for retrieval.
Using Luyu/co-condenser-marco-retriever
I was able to get 0.366 MRR@10 on MS MARCO dev.small, which is great. However, this is exactly the number reported for the Condenser model (in the Condenser and Cocondenser papers), while the MRR@10 reported for the CoCondenser model is 0.382. Are you certain that Luyu/co-condenser-marco-retriever
is a checkpoint of Cocodenser and not Condenser?
Are you certain that Luyu/co-condenser-marco-retriever is a checkpoint of Cocodenser and not Condenser.
Yes.
Is the corpus you use align with the document here? https://github.com/texttron/tevatron/tree/main/examples/coCondenser-marco The corpus in Tevatron has a title field
Thank you, I was using the official MS MARCO collection.tsv
, and wasn't aware that RocketQA and (co)Condenser used a corpus with a title field. Using this new corpus, I was able to measure an improved performance of 0.369 MRR@10 on MS MARCO dev.small. This is still a bit lower than the 0.382 reported in the paper for coCondenser. Would you have any idea about what could explain the difference?
E.g., when tokenizing the documents of this enhanced , are you simply concatenating the title and main body tokens, or are you using a separator token in-between?
I think they use [SEP] to separate title and text during training and encoding: https://github.com/texttron/tevatron/blob/adf5ce45612332797931569d51cc5bcd8c1ac878/src/tevatron/preprocessor/preprocessor_tsv.py#L92
Thank you very much for confirming this. I tested tokenizing with the separator token in-between, and performance indeed jumped to 0.3813 MRR@10, which I find amazing (this improvement is very remarkable for such a minor change).
In any case, this is certainly close enough to the reported 0.382, and I find it very encouraging that it was independently verified using a separate code base. Thank you very much for your prompt responses, I am certain that this level of transparency and support helps advance research!
I understand the need to keep them separate, and that they have different weights after training on MSMARCO, but is there a difference in how they handle the input sequence? In specific, when calling
model = AutoModel.from_pretrained('co-condenser-marco')
, what do we get, a query or passage encoder? Or both, and we need to pass a special token type ID token depending on whether a query or passage is given as input? What does the optionpython -m tevatron.driver.encode --encode_is_qry
practically do, compared to omitting it?