Wrong perplexity when evaluate the megatron-gpt2.

codecaution commented 3 years ago

Environment info

transformers version: 4.7.0.dev0
Platform: Linux-5.4.0-1046-azure-x86_64-with-debian-buster-sid
Python version: 3.6.10
PyTorch version (GPU?): 1.8.1+cu111 (True)
Tensorflow version (GPU?): not installed (NA)
Using GPU in script?: Yes
Using distributed or parallel set-up in script?: No

Who can help

@jdemouth @LysandreJik @sgugger

Information

Model I am using gpt2(megatron-gpt2-345m):

The problem arises when using:

[x] the official example scripts: (give details below)
[ ] my own modified scripts: (give details below)

The tasks I am working on is:

[x] an official language-modeling task: (transformers/examples/pytorch/language-modeling/run_clm.py )
[ ] my own task or dataset: (give details below)

To reproduce

Steps to reproduce the behavior:

Follow the steps given by huggingface to convert the megatron-lm model to huggingface model.
- export MYDIR=/mnt/reproduce
- git clone https://github.com/huggingface/transformers.git $MYDIR/transformers
- mkdir -p $MYDIR/nvidia/megatron-gpt2-345m
- wget --content-disposition https://api.ngc.nvidia.com/v2/models/nvidia/megatron_lm_345m/versions/v0.0/zip -O $MYDIR/nvidia/megatron-gpt2-345m/checkpoint.zip
- python3 $MYDIR/transformers/src/transformers/models/megatron_gpt2/convert_megatron_gpt2_checkpoint.py $MYDIR/nvidia/megatron-gpt2-345m/checkpoint.zip (Here I meet error: "io.UnsupportedOperation: seek. You can only torch.load from a file that is seekable. Please pre-load the data into a buffer like io.BytesIO and try to load from it instead." And I solve it by,
  - unzip $MYDIR/nvidia/megatron-gpt2-345m/checkpoint.zip,
  - change the code in transformers/src/transformers/models/megatron_gpt2/convert_megatron_gpt2_checkpoint.py Line 209-211 by
```
with open(args.path_to_checkpoint, "rb") as pytorch_dict:
    input_state_dict = torch.load(pytorch_dict, map_location="cpu")
```
```
- python3 $MYDIR/transformers/src/transformers/models/megatron_gpt2/convert_megatron_gpt2_checkpoint.py $MYDIR/nvidia/megatron-gpt2-345m/release/mp_rank_00/model_optim_rng.pt
```
- git clone https://huggingface.co/nvidia/megatron-gpt2-345m/
- mv $MYDIR/nvidia/megatron-gpt2-345m/release/mp_rank_00/pytorch_model.bin $MYDIR/nvidia/megatron-gpt2-345m/release/mp_rank_00/config.json $MYDIR/megatron-gpt2-345m/

run the clm.py tests on wikitext-2, the scripts is given by readme.

CUDA_VISIBLE_DEVICES=0 python $MYDIR/transformers/examples/pytorch/language-modeling/run_clm.py \
--model_name_or_path $MYDIR/megatron-gpt2-345m \
--dataset_name wikitext \
--dataset_config_name wikitext-2-raw-v1 \
--do_eval \
--output_dir /mnt/logs/evaluation/megatron/wikitext-2

The results are shown as, which shows the wrong perplexity(I also test on other datasets, and the perplexity results are also big):

[INFO|trainer_pt_utils.py:907] 2021-05-28 04:17:49,817 >> ***** eval metrics *****
[INFO|trainer_pt_utils.py:912] 2021-05-28 04:17:49,817 >>   eval_loss               =       11.63
[INFO|trainer_pt_utils.py:912] 2021-05-28 04:17:49,817 >>   eval_runtime            =  0:00:22.85
[INFO|trainer_pt_utils.py:912] 2021-05-28 04:17:49,817 >>   eval_samples            =         240
[INFO|trainer_pt_utils.py:912] 2021-05-28 04:17:49,817 >>   eval_samples_per_second =      10.501
[INFO|trainer_pt_utils.py:912] 2021-05-28 04:17:49,817 >>   eval_steps_per_second   =       1.313
[INFO|trainer_pt_utils.py:912] 2021-05-28 04:17:49,817 >>   perplexity              = 112422.0502

Expected behavior

I want to convert my megatron-lm model checkpoints into huggingface. Please help me.

jdemouth commented 3 years ago

We’ll try to reproduce the issue on our side. We’ll keep you posted. Thanks!

codecaution commented 3 years ago

We’ll try to reproduce the issue on our side. We’ll keep you posted. Thanks!

Thanks for your help!

jdemouth commented 3 years ago

We (NVIDIA engineers) were able to reproduce strange perplexity results and we are trying to identify the root cause. We will update you as we know more. Thanks for reporting the issue and for the reproducer.

hwijeen commented 3 years ago

Hi, I think #12004 is an related issue

huggingface / transformers