qqaatw / pytorch-realm-orqa

PyTorch reimplementation of REALM and ORQA
Apache License 2.0
22 stars 2 forks source link

Parameters of the retriever in fine-tuning #9

Open catalwaysright opened 2 years ago

catalwaysright commented 2 years ago

Hi! I am wondering why the retriever is frozen during fine-tuning time. I think the retriever will learn more in fine-tuning. I am not very familiar with tensorflow. Is it possible to update the parameters of the retriever during fine-tuning time with this repository? How?

qqaatw commented 2 years ago

See #5 #6, and see the papers.

catalwaysright commented 2 years ago

See #5 #6, and see the papers.

Thanks for your reply! I have checked the issues and the paper. I just want to double check if I get it right. The parameters of query embedder are actually updated during fine-tuning but we just don't update the document embeddings with the updated query embedder. Thus, the embeddings of the same question will be different since the query embedder is optimized during fine-tuning and we may get different top-k relevant documents in the process of fine-tuning even if we input the same question.

qqaatw commented 2 years ago

Indeed, that is how optimization works, isn’t it?

We could migrate the async index refresh here, but it requires a lot of work due to its complexity.

On Sat, Mar 19, 2022 at 12:40 PM catalwaysright @.***> wrote:

See #5 https://github.com/qqaatw/pytorch-realm-orqa/issues/5 #6 https://github.com/qqaatw/pytorch-realm-orqa/issues/6, and see the papers.

Thanks for your reply! I have checked the issues and the paper. I just want to double check if I get it right. The parameters of query embedder are actually updated during fine-tuning but we just don't update the document embeddings with the updated query embedder. Thus, the embeddings of the same question will be different since the query embedder is optimized during fine-tuning and we may get different top-k relevant documents in the process of fine-tuning even if we input the same question.

— Reply to this email directly, view it on GitHub https://github.com/qqaatw/pytorch-realm-orqa/issues/9#issuecomment-1072939147, or unsubscribe https://github.com/notifications/unsubscribe-auth/AF5PKNTIBHEP4DRFOGRKEUTVAVLDRANCNFSM5RDJG35A . You are receiving this because you commented.Message ID: @.***>

catalwaysright commented 2 years ago

Another question is that I downloaded the natural_questions dataset to local but when I tried to load it using the load function provided in data.py, it showed that Dataset path currently not supported., which is just because it is local and I provide an OS path. How to fix it and load the local natural_questions dataset?

qqaatw commented 2 years ago

How did you download NQ?

On Sun, Mar 20, 2022 at 9:19 AM catalwaysright @.***> wrote:

Another question is that I downloaded the natural_questions dataset to local but when I tried to load it using the load function provided in data.py, it showed that Dataset path currently not supported., which is just because it is local and I provide an OS path. How to fix it and load the local natural_questions dataset?

— Reply to this email directly, view it on GitHub https://github.com/qqaatw/pytorch-realm-orqa/issues/9#issuecomment-1073142616, or unsubscribe https://github.com/notifications/unsubscribe-auth/AF5PKNXXYRJBHTYMMFVTRQTVAZ4IVANCNFSM5RDJG35A . You are receiving this because you commented.Message ID: @.***>

catalwaysright commented 2 years ago

How did you download NQ? On Sun, Mar 20, 2022 at 9:19 AM catalwaysright @.> wrote: Another question is that I downloaded the natural_questions dataset to local but when I tried to load it using the load function provided in data.py, it showed that Dataset path currently not supported., which is just because it is local and I provide an OS path. How to fix it and load the local natural_questions dataset? — Reply to this email directly, view it on GitHub <#9 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AF5PKNXXYRJBHTYMMFVTRQTVAZ4IVANCNFSM5RDJG35A . You are receiving this because you commented.Message ID: @.>

by using gsutil -m cp -R gs://natural_questions/v1.0 <path to your data directory> and the structure is like this 1647747161(1)

qqaatw commented 2 years ago

The preferred way to download is using huggingface’s datasets library, which provides many utilities like caching, mapping, and filtering. The dataset’s source this library uses is also from Google.

If you however want to handle them by yourself, you’ll need to design a dataset loading function in data.py that returns the same format as load_nq().

On Sun, Mar 20, 2022 at 11:35 AM catalwaysright @.***> wrote:

How did you download NQ? … <#m-6377894214107844352> On Sun, Mar 20, 2022 at 9:19 AM catalwaysright @.> wrote: Another question is that I downloaded the natural_questions dataset to local but when I tried to load it using the load function provided in data.py, it showed that Dataset path currently not supported., which is just because it is local and I provide an OS path. How to fix it and load the local natural_questions dataset? — Reply to this email directly, view it on GitHub <#9 (comment) https://github.com/qqaatw/pytorch-realm-orqa/issues/9#issuecomment-1073142616>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AF5PKNXXYRJBHTYMMFVTRQTVAZ4IVANCNFSM5RDJG35A https://github.com/notifications/unsubscribe-auth/AF5PKNXXYRJBHTYMMFVTRQTVAZ4IVANCNFSM5RDJG35A . You are receiving this because you commented.Message ID: @.>

by using gsutil -m cp -R gs://natural_questions/v1.0 <path to your data directory> and the structure is like this [image: 1647747161(1)] https://user-images.githubusercontent.com/60195620/159146888-6d2d70eb-322d-4b17-bafd-5df1979d36c1.png

— Reply to this email directly, view it on GitHub https://github.com/qqaatw/pytorch-realm-orqa/issues/9#issuecomment-1073159145, or unsubscribe https://github.com/notifications/unsubscribe-auth/AF5PKNTNYLZ53DXMTUQHNJLVA2MG3ANCNFSM5RDJG35A . You are receiving this because you commented.Message ID: @.***>

catalwaysright commented 2 years ago

Thank you so much for answering my questions so patiently! I encountered another problem when running run_finetune.py with the exactly same args as your experiment. However, I got cuda out of memory like this. image I am running it on one V100 GPU with 15GB memory and I set the batch size to 1. Is it still not big enough to run this? How could I reduce the memory consumption and reproduce the experiment?

qqaatw commented 2 years ago

Hi, the fine-tune training given the default configuration can be run on single RTX 2080Ti, so V100 with 15GB mem is totally sufficient. You may find the reasons/solutions by googling the error message.

@catalwaysright Hey sorry I forgot to mention this, If you installed transformers from master, you may need to add this line model.block_embedding_to("cpu") after sending the model to GPU because the latest patch for REALM by default has block_emb tensor, which would occupy appreciable GPU memory, sent to GPU along with model.cuda().

catalwaysright commented 2 years ago

Sorry for bothering you again. Please show the specific place I should add model.block_embedding_to("cpu"), because when I add it after sending the model to GPU in run_finetune.py, it shows AttributeError: 'RealmForOpenQA' object has no attribute 'block_embedding_to'. Thanks!

qqaatw commented 2 years ago

Hi, which version of transformers are you using? You can install transformers==4.18.0, where the latest REALM patch is included.

https://huggingface.co/docs/transformers/model_doc/realm#transformers.RealmForOpenQA.block_embedding_to

catalwaysright commented 2 years ago

I tried your approach and is still shows cuda out of memory, but I figured it out that it may be normal because there is only 8G memory left on V100, which is not enough to load and optimize the whole model. How much space did you allocate in your RTX2080Ti?

qqaatw commented 2 years ago

Please reserve GPU memory at least equal or greater than 2080Ti. This is the minimal requirement.

On Sat, Apr 16, 2022 at 11:22 AM catalwaysright @.***> wrote:

I tried your approach and is still shows cuda out of memory, but I figured it out that it may be normal because there is only 8G memory left on V100, which is not enough to load and optimize the whole model. How much space did you allocate in your RTX2080Ti?

— Reply to this email directly, view it on GitHub https://github.com/qqaatw/pytorch-realm-orqa/issues/9#issuecomment-1100522890, or unsubscribe https://github.com/notifications/unsubscribe-auth/AF5PKNRYQSQPLXLUS3MSFEDVFIXANANCNFSM5RDJG35A . You are receiving this because you commented.Message ID: @.***>

catalwaysright commented 2 years ago

Hi! Now I am modifying this model with multiple retrievers and I am trying to train this model. However, during the training process, I found that the retriever loss and reader loss are all 0.0 at most times while the reader loss is also often 0.0 when I was training the original model. Why would there be so many 0.0? Is this normal at the beginning or there are other tricks of training this model.

qqaatw commented 2 years ago

If there is no presence of ground truth in any retrieved context or predicted answer span, their loss will be set to zero respectively to prevent ineffective updates.

https://github.com/huggingface/transformers/blob/v4.19.2/src/transformers/models/realm/modeling_realm.py#L1662-L1663

It's likely to happen when you train the model from scratch without loading a pre-trained checkpoint like cc_news or having proper warm up.

catalwaysright commented 2 years ago

On I see! So it will be fine after more steps right?

qqaatw commented 2 years ago

For training from scratch, you should follow the steps in REALM/ORQA paper to pre-train/warmup your model; otherwise, the model is unlikely to further improve. If you were fine-tuning from cc-news or a proper pre-trained checkpoint, then you can keep training and check the improvement of the losses.