I'm trying to reproduce your distillation results from Section 2 of the FastFormers paper and I have a few questions I was hoping you could help with:
Did you use the weights provided in the TinyBERT repo (link) or those provided by Huawei in the HuggingFace model hub (link)?
Did you use General_TinyBERT(Nlayer-Ddim) or General_TinyBERT_v2(Nlayer-Ddim)?
I noticed that the Huawei models on the HuggingFace hub do not appear to be compatible with the Transformers library, so e.g. I get errors like the following:
>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("huawei-noah/TinyBERT_General_4L_312D")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/lewtun/git/transformers/src/transformers/models/auto/tokenization_auto.py", line 345, in from_pretrained
config = AutoConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)
File "/Users/lewtun/git/transformers/src/transformers/models/auto/configuration_auto.py", line 360, in from_pretrained
raise ValueError(
ValueError: Unrecognized model in huawei-noah/TinyBERT_General_4L_312D. Should have a `model_type` key in its config.json, or contain one of the following strings in its name: retribert, mt5, t5, mobilebert, distilbert, albert, bert-generation, camembert, xlm-roberta, pegasus, marian, mbart, mpnet, bart, blenderbot, reformer, longformer, roberta, deberta, flaubert, fsmt, squeezebert, bert, openai-gpt, gpt2, transfo-xl, xlnet, xlm-prophetnet, prophetnet, xlm, ctrl, electra, encoder-decoder, funnel, lxmert, dpr, layoutlm, rag, tapas
Did you have to do something special to load TinyBERT in your FastFormers experiments? Looking at your source code (link) it seems you use the standard from_pretrained methods of the Transformers library, so I'm curious whether you encountered the same problem.
Did you use the data augmentation technique from TinyBERT (i.e. combine BERT with GloVe word embeddings) in your experiments? Looking at your codebase, I could not see this being used, but just want to double-check since it appears to play an important role in the TinyBERT paper.
Finally, what values of state_loss_ratio and att_loss_ratio did you use to generate the distilled model in Table 3 of your paper?
For reference, I am not working directly from the fastformers repo, so have the following dependencies:
- `transformers` version: 4.0.0-rc-1
- Platform: Linux-4.15.0-72-generic-x86_64-with-Ubuntu-18.04-bionic
- Python version: 3.6.9
- PyTorch version (GPU?): 1.6.0 (True)
- Tensorflow version (GPU?): 2.3.0 (True)
- Using GPU in script?: (True)
- Using distributed or parallel set-up in script?: None
❓ Questions & Help
Details
Hello again @ykim362,
I'm trying to reproduce your distillation results from Section 2 of the FastFormers paper and I have a few questions I was hoping you could help with:
General_TinyBERT(Nlayer-Ddim)
orGeneral_TinyBERT_v2(Nlayer-Ddim)
?Did you have to do something special to load TinyBERT in your FastFormers experiments? Looking at your source code (link) it seems you use the standard
from_pretrained
methods of the Transformers library, so I'm curious whether you encountered the same problem.state_loss_ratio
andatt_loss_ratio
did you use to generate the distilled model in Table 3 of your paper?For reference, I am not working directly from the
fastformers
repo, so have the following dependencies:Thank you!