About OOM problem - Githubissues

mt-upc / iwslt-2021

Systems submitted to IWSLT 2021 by the MT-UPC group.

MIT License

14 stars 4 forks source link

About OOM problem #3

Closed zhouyan19 closed 2 years ago

zhouyan19 commented 2 years ago

I'm trying to reproduce your result on MuST-C en-de but I run into the problem of OOM. I use 4 RTX 3090 (24G) gpus, wav2vec 2.0 base model , mbart one-to-many model and mbart 250,000 dictionary. The total number of parameters is 817M (344M to train), and I find that the maximum number of "max-tokens" I can set is 40,000, while yours is 440,000. I think cutting the dictionary may help but I notice that you just use the original dictionary. Do you have any thoughts on this? Or can you reveal the total parameters of your model? Thank you a lot!

gegallego commented 2 years ago

Hi!

We didn't modify the dictionary. Considering that we are freezing the embeddings and the final linear projection, its size should have little impact in the memory.

When we were working on this project, we used smaller GPUs than yours (2080), when we started using 3090 GPUs for the next IWSLT participation, we could even increase max_tokens to 1,150,000.

Can you confirm that you are using the exact same config than us?

zhouyan19 commented 2 years ago

Thank you! Perhaps I need to examine my settings again. Maybe there's something wrong with my training script. btw, I guess your total number of parameters is about 1B？I use wav2vec 2.0 base model and my total number of parameters is 817M.

gegallego commented 2 years ago

Hi! I checked the training log of the lna_ed architecture and this is what I found:

[2021-04-21 08:06:23,627][fairseq_cli.train][INFO] - num. shared model params: 781,977,216 (num. trained: 159,188,992)

I don't understand why you have more parameters than us, considering that you are using wav2vec base instead of the large architecture.

Apart from this, maybe you removed batch_size from the config? We had OOMs when just setting max_tokens, because there were some batches containing a lot of small samples. That's why we limit the size of the batch with both max_tokens: 480_000 and batch_size: 18.

Are you using fp16: True and memory_efficient_fp16: True, right?

zhouyan19 commented 2 years ago

Thank you! I use fp16 so it won't be the problem. But I did not set bacth_size and I think that's the exact reason for my problem. I've never noticed this. About the mismatch of parameters, I'll check my model structure. Thank you for your help!

gegallego commented 2 years ago

Perfect, I'm happy to hear that! Tell us if you need anything else 👍