tatp22 / linformer-pytorch

My take on a practical implementation of Linformer for Pytorch.
https://arxiv.org/pdf/2006.04768.pdf
MIT License
403 stars 36 forks source link

Enquiry about your implementation #10

Closed riven314 closed 4 years ago

riven314 commented 4 years ago

Thanks for your great work!

I have a few enquiries about your implementations:

  1. Could you reproduce the paper results (or approximately similar) with your implementation?
  2. While ordinary transformer requires multiple GPUs to train from scratch, as for your implementation of Linformer, is it possible to train it from scratch with single GPU only(8GB/ 11GB)?

Thanks, Alex Lau

tatp22 commented 4 years ago

Hi @riven314!

  1. I could in theory reproduce the results, but my hardware isn't as powerful as the one at FB. As stated in section 5.1 of the paper:
    We first compare the pretraining performance of our proposed architecture against RoBERTa (Liuet al., 2019), which is based on
    the Transformer. Following Devlin et al. (2019), we use Book Corpus (Zhu et al., 2015) plus English Wikipedia as our pretraining 
    set (3300M words). All models are pretrained with the masked-language-modeling (MLM) objective, and the training for all 
    experiments are parallelized across 64 Tesla V100 GPUs with 250k updates.

    I unfortunately don't have 64 V100 GPUs at my disposal, so if I wanted to reproduce the results, it would take way longer than for the authors of the paper. That being said, if someone reading this does have resources at their disposal, they can try the tests themselves.

  2. That being said, it is totally possible to train it using one GPU, although it will take longer to do so. Of course, it depends on task you are training for, but as long as one training example doesn't take more than the GPU memory, you can work with a small batch size and come to the same results as if you had multiple GPUs. In fact, the reasons for having multiple GPUs in the first place is just to speed up training, whether it be through model or data parallelism.

Hope this helps!

riven314 commented 4 years ago

thanks!