huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
133.46k stars 26.66k forks source link

Performance (perplexity) decrease after conversion megatronGTP2 to hugging face model #17483

Closed skdirwj closed 2 years ago

skdirwj commented 2 years ago

System Info

transformers==4.19.2
PyTorch: 1.11.0
CUDA: cu11.0
Train GPUs: 1node (A100 8gpus)
Test GPUs: A100 1gpu
Megatron-LM: https://github.com/NVIDIA/Megatron-LM

Who can help?

@younesbelkada

Information

Tasks

Reproduction

  1. Pretrain my own megatronGPT2 on corpus almost similar to that used in a pre-trained megatronGPT2 by Nvidia (https://ngc.nvidia.com/catalog/models/nvidia:megatron_lm_345m)
  2. Test perpexity on WIKITEXT103 testset and compare performance of the above pre-trained megatronGPT2 models using evaluation script
  3. Convert the above pre-trained models to hugging models using the conversion script in transformers
    • https://github.com/huggingface/transformers/blob/main/src/transformers/models/megatron_gpt2/convert_megatron_gpt2_checkpoint.py
    • Belows show the config files of converted models. activation_function and vocab_size are different.
    • Mine { "activation_function": "gelu_fast", "architectures": [ "GPT2LMHeadModel" ], "attn_pdrop": 0.1, "bos_token_id": 50256, "embd_pdrop": 0.1, "eos_token_id": 50256, "initializer_range": 0.02, "layer_norm_epsilon": 1e-05, "model_type": "gpt2", "n_embd": 1024, "n_head": 16, "n_inner": 4096, "n_layer": 24, "n_positions": 1024, "reorder_and_upcast_attn": false, "resid_pdrop": 0.1, "scale_attn_by_inverse_layer_idx": false, "scale_attn_weights": true, "summary_activation": null, "summary_first_dropout": 0.1, "summary_proj_to_labels": true, "summary_type": "cls_index", "summary_use_proj": true, "tokenizer_class": "GPT2TokenizerFast", "transformers_version": "4.19.2", "use_cache": true, "vocab_size": 50304 }
   - Nvidia
   {
  "activation_function": "gelu_new",
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.1,
  "bos_token_id": 50256,
  "embd_pdrop": 0.1,
  "eos_token_id": 50256,
  "initializer_range": 0.02,
  "layer_norm_epsilon": 1e-05,
  "model_type": "gpt2",
  "n_embd": 1024,
  "n_head": 16,
  "n_inner": 4096,
  "n_layer": 24,
  "n_positions": 1024,
  "reorder_and_upcast_attn": false,
  "resid_pdrop": 0.1,
  "scale_attn_by_inverse_layer_idx": false,
  "scale_attn_weights": true,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "summary_use_proj": true,
  "tokenizer_class": "GPT2TokenizerFast",
  "transformers_version": "4.19.2",
  "use_cache": true,
  "vocab_size": 50257
}
  1. Test perpexity on WIKITEXT103 testset and compare performance of the converted hugging models by following the guide
  2. Below table is my test results
    • Before means pre-trained models' and After means converted models'.
Before After
NVIDIA Megatron_345M 14.77 17.15
My_Model_345M 15.73 23.89

Expected behavior

I am wandering where the performance difference between converted Mine and Nvidia models comes from.

In addition, I do not know why the vocab size of Mine had been changed from 50,257 to 50,304.
(the number 50,304 means vocab size 50,257 plus dummy token)
I manually changed `activation_function` and `vocab_size` in config file of Mine to same with Nvidia and test it again. But, the performance difference is same.

I expect similar perplexity from converted hugging face models of both pre-trained my own model and Nvidia.
Do anyone have similar experience?
github-actions[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.