microsoft / fastformers

FastFormers - highly efficient transformer models for NLU
700 stars 54 forks source link

Which TinyBERT models used for student initialisation? #11

Closed lewtun closed 3 years ago

lewtun commented 3 years ago

❓ Questions & Help


Hello again @ykim362,

I'm trying to reproduce your distillation results from Section 2 of the FastFormers paper and I have a few questions I was hoping you could help with:

  1. Did you use the weights provided in the TinyBERT repo (link) or those provided by Huawei in the HuggingFace model hub (link)?
  2. Did you use General_TinyBERT(Nlayer-Ddim) or General_TinyBERT_v2(Nlayer-Ddim)?
  3. I noticed that the Huawei models on the HuggingFace hub do not appear to be compatible with the Transformers library, so e.g. I get errors like the following:
    >>> from transformers import AutoTokenizer
    >>> tokenizer = AutoTokenizer.from_pretrained("huawei-noah/TinyBERT_General_4L_312D")
    Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
    File "/Users/lewtun/git/transformers/src/transformers/models/auto/", line 345, in from_pretrained
    config = AutoConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)
    File "/Users/lewtun/git/transformers/src/transformers/models/auto/", line 360, in from_pretrained
    raise ValueError(
    ValueError: Unrecognized model in huawei-noah/TinyBERT_General_4L_312D. Should have a `model_type` key in its config.json, or contain one of the following strings in its name: retribert, mt5, t5, mobilebert, distilbert, albert, bert-generation, camembert, xlm-roberta, pegasus, marian, mbart, mpnet, bart, blenderbot, reformer, longformer, roberta, deberta, flaubert, fsmt, squeezebert, bert, openai-gpt, gpt2, transfo-xl, xlnet, xlm-prophetnet, prophetnet, xlm, ctrl, electra, encoder-decoder, funnel, lxmert, dpr, layoutlm, rag, tapas

    Did you have to do something special to load TinyBERT in your FastFormers experiments? Looking at your source code (link) it seems you use the standard from_pretrained methods of the Transformers library, so I'm curious whether you encountered the same problem.

  4. Did you use the data augmentation technique from TinyBERT (i.e. combine BERT with GloVe word embeddings) in your experiments? Looking at your codebase, I could not see this being used, but just want to double-check since it appears to play an important role in the TinyBERT paper.
  5. Finally, what values of state_loss_ratio and att_loss_ratio did you use to generate the distilled model in Table 3 of your paper?

For reference, I am not working directly from the fastformers repo, so have the following dependencies:

- `transformers` version: 4.0.0-rc-1
- Platform: Linux-4.15.0-72-generic-x86_64-with-Ubuntu-18.04-bionic
- Python version: 3.6.9
- PyTorch version (GPU?): 1.6.0 (True)
- Tensorflow version (GPU?): 2.3.0 (True)
- Using GPU in script?: (True)
- Using distributed or parallel set-up in script?: None

Thank you!

ykim362 commented 3 years ago
  1. For the paper, we used the model from the TinyBERT repo.
  2. We used v2.
  3. I haven't tried the one from the HuggingFace hub. For the TinyBERT model, we used model_type as bert.
  4. We haven't used data augmentation.
  5. We used logits distillation only for the table 3 which means 0.0 for both state_loss_ratio and att_loss_ratio.


lewtun commented 3 years ago

Thanks a lot for the answers - they're really helpful!

Closing this issue since all my questions are answered :)