Official EM and Finetuning program

Hannibal046 commented 2 years ago

Hi, Thanks for the great work! It really helps me a lot.

I am curious about the difference between EM and official EM. And what is to be improved in run_finetune.py considering its experimental flag. Thanks so much

qqaatw commented 2 years ago

Hi,

Thanks for interested in this project.

The difference between EM and official EM is that EM requires a predicted answer span being exactly the same as the target answer span, which is a logit comparison; on the other hand, official EM compares a normalized (e.g. stripping white spaces) answer text with the target text, which is a text comparison and comparable with other QA models, and that's why it's called "official".

For the "experimental" notation, I used it because I had not fully reproduced the results on the paper through that fine-tuning script, but now the results can be fully reproduced so you can just ignore that notation.

The difference between the original TF impl. and this project is that we allow you to add additional documents for retrieval during fine-tuning or prediction, though the embeddings of these documents will not be further updated during fine-tuning, by design.

Hannibal046 commented 2 years ago

OK. Thanks for your quick and detailed reply !

After downloading the data and code, I have following questions:

in this project, we can easily add additional documents for retrieval during fine-tuning or prediction. But when executing python predictor.py --question "Who is the pioneer in modern computer science?" with additional documents, I notice that the additional document embedding is obtained by fine-tuned realm_nq_embedder. But considering the query embedder and document embedder are tied. Is it more reasonable to encode the additional document with cc_news_pretrained_embedder ? Because the embedder is changed during fine-tuning with original document embedding fixed.
how to accelerate predictor.py by using Cuda?
when executing checkpoint_converter.py, I got following result. Does this mean a successful conversion from TF to PT?

when benchmark NQ datasets, I have to download nearly 100G data from huggingface while most of them are useless, is there a better way ?

qqaatw commented 2 years ago

TL;DR: Not really, though I didn't compare the performance between them.

Although the doc_embed is frozen during fine-tuning by the original TF implementation for simplicity, the paper actually mentions that asynchronous refreshes can be used for both pre-training and fine-tuning. So in this case, using realm_nq_embedder is similar to retrieve embeddings from an updated/non-stale embedder θ (though these additional documents are definitely not marginalized during fine-tuning); on the other hand, getting embeddings from cc_news_pretrained_embedder basically means we're using a stale embedder θ because it's not further updated along with fine-tuning.

A section quoted from REALM paper:

While asynchronous refreshes can be used for both pre-training and fine-tuning, in our experiments we only use it for pre-training. For fine-tuning, we just build the MIPS index once (using the pre-trained θ) for simplicity and do not update Embed_doc. Note that we still fine-tune Embed_input, so the retrieval function is still updated from the query side.

The reason why doc_embed is frozen:

This works because pre-training already yields a good Embeddoc function. However, it is possible that refreshing the index would further improve performance.

predictor.py doesn't provide a device option, needed to be modified by yourself.
Sure that's normal because the size of block_emb tensor is very huge.
I'm afraid no. Or you can dive into huggingface's datasets library and find if there is a way to circumvent this.

Hannibal046 commented 2 years ago

Hi, thanks for your reply.

As for fixed document embedding, one crucial part is that the newly added additional document should be encoded like the ones it is concatenated with (Wiki data). So a cc_news_pretrained_embedder is more proper here, from my personal view. And in fine-tuning phase, the embedder is only updated from query side rather than from document side.

And for the reason why asynchronous index refresh didn't apply in fine-tuning phase, I think it is a trade-off between computational cost and improved performance. (so does RAG model).

qqaatw commented 2 years ago

Yes, I believe it was a trade-off.

Unless an evidence that embeddings from "pre-trained embedder" outperform most of the time is provided, IMO I would not judge it's more "proper" just because the resulting embeddings between original wiki data and newly added documents are aligned.

qqaatw commented 2 years ago

Also note that the aforementioned embeddings are only used for providing top-K relevant documents, the reader still needs to encode them using it's own embeddings, so I think which embedder is used for retrieval doesn't matter too much.

qqaatw / pytorch-realm-orqa

Official EM and Finetuning program #6