Closed zhouyan19 closed 2 years ago
Hi!
We didn't modify the dictionary. Considering that we are freezing the embeddings and the final linear projection, its size should have little impact in the memory.
When we were working on this project, we used smaller GPUs than yours (2080), when we started using 3090 GPUs for the next IWSLT participation, we could even increase max_tokens
to 1,150,000.
Can you confirm that you are using the exact same config than us?
Thank you! Perhaps I need to examine my settings again. Maybe there's something wrong with my training script. btw, I guess your total number of parameters is about 1B?I use wav2vec 2.0 base model and my total number of parameters is 817M.
Hi! I checked the training log of the lna_ed
architecture and this is what I found:
[2021-04-21 08:06:23,627][fairseq_cli.train][INFO] - num. shared model params: 781,977,216 (num. trained: 159,188,992)
I don't understand why you have more parameters than us, considering that you are using wav2vec base instead of the large architecture.
Apart from this, maybe you removed batch_size
from the config? We had OOMs when just setting max_tokens
, because there were some batches containing a lot of small samples. That's why we limit the size of the batch with both max_tokens: 480_000
and batch_size: 18
.
Are you using fp16: True
and memory_efficient_fp16: True
, right?
Thank you! I use fp16 so it won't be the problem. But I did not set bacth_size
and I think that's the exact reason for my problem. I've never noticed this.
About the mismatch of parameters, I'll check my model structure.
Thank you for your help!
Perfect, I'm happy to hear that! Tell us if you need anything else 👍
I'm trying to reproduce your result on MuST-C en-de but I run into the problem of OOM. I use 4 RTX 3090 (24G) gpus, wav2vec 2.0 base model , mbart one-to-many model and mbart 250,000 dictionary. The total number of parameters is 817M (344M to train), and I find that the maximum number of "max-tokens" I can set is 40,000, while yours is 440,000. I think cutting the dictionary may help but I notice that you just use the original dictionary. Do you have any thoughts on this? Or can you reveal the total parameters of your model? Thank you a lot!