huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
131.25k stars 26.09k forks source link

Wrong perplexity when evaluate the megatron-gpt2. #11916

Closed codecaution closed 3 years ago

codecaution commented 3 years ago

Environment info

Who can help

@jdemouth @LysandreJik @sgugger

Information

Model I am using gpt2(megatron-gpt2-345m):

The problem arises when using:

The tasks I am working on is:

To reproduce

Steps to reproduce the behavior:

  1. Follow the steps given by huggingface to convert the megatron-lm model to huggingface model.

    • export MYDIR=/mnt/reproduce
    • git clone https://github.com/huggingface/transformers.git $MYDIR/transformers
    • mkdir -p $MYDIR/nvidia/megatron-gpt2-345m
    • wget --content-disposition https://api.ngc.nvidia.com/v2/models/nvidia/megatron_lm_345m/versions/v0.0/zip -O $MYDIR/nvidia/megatron-gpt2-345m/checkpoint.zip
    • python3 $MYDIR/transformers/src/transformers/models/megatron_gpt2/convert_megatron_gpt2_checkpoint.py $MYDIR/nvidia/megatron-gpt2-345m/checkpoint.zip (Here I meet error: "io.UnsupportedOperation: seek. You can only torch.load from a file that is seekable. Please pre-load the data into a buffer like io.BytesIO and try to load from it instead." And I solve it by,

      • unzip $MYDIR/nvidia/megatron-gpt2-345m/checkpoint.zip,
      • change the code in transformers/src/transformers/models/megatron_gpt2/convert_megatron_gpt2_checkpoint.py Line 209-211 by
        
        with open(args.path_to_checkpoint, "rb") as pytorch_dict:
            input_state_dict = torch.load(pytorch_dict, map_location="cpu")
      
      - python3 $MYDIR/transformers/src/transformers/models/megatron_gpt2/convert_megatron_gpt2_checkpoint.py $MYDIR/nvidia/megatron-gpt2-345m/release/mp_rank_00/model_optim_rng.pt
    • git clone https://huggingface.co/nvidia/megatron-gpt2-345m/
    • mv $MYDIR/nvidia/megatron-gpt2-345m/release/mp_rank_00/pytorch_model.bin $MYDIR/nvidia/megatron-gpt2-345m/release/mp_rank_00/config.json $MYDIR/megatron-gpt2-345m/
  2. run the clm.py tests on wikitext-2, the scripts is given by readme.
    CUDA_VISIBLE_DEVICES=0 python $MYDIR/transformers/examples/pytorch/language-modeling/run_clm.py \
    --model_name_or_path $MYDIR/megatron-gpt2-345m \
    --dataset_name wikitext \
    --dataset_config_name wikitext-2-raw-v1 \
    --do_eval \
    --output_dir /mnt/logs/evaluation/megatron/wikitext-2
  3. The results are shown as, which shows the wrong perplexity(I also test on other datasets, and the perplexity results are also big):
    [INFO|trainer_pt_utils.py:907] 2021-05-28 04:17:49,817 >> ***** eval metrics *****
    [INFO|trainer_pt_utils.py:912] 2021-05-28 04:17:49,817 >>   eval_loss               =       11.63
    [INFO|trainer_pt_utils.py:912] 2021-05-28 04:17:49,817 >>   eval_runtime            =  0:00:22.85
    [INFO|trainer_pt_utils.py:912] 2021-05-28 04:17:49,817 >>   eval_samples            =         240
    [INFO|trainer_pt_utils.py:912] 2021-05-28 04:17:49,817 >>   eval_samples_per_second =      10.501
    [INFO|trainer_pt_utils.py:912] 2021-05-28 04:17:49,817 >>   eval_steps_per_second   =       1.313
    [INFO|trainer_pt_utils.py:912] 2021-05-28 04:17:49,817 >>   perplexity              = 112422.0502

Expected behavior

I want to convert my megatron-lm model checkpoints into huggingface. Please help me.

jdemouth commented 3 years ago

We’ll try to reproduce the issue on our side. We’ll keep you posted. Thanks!

codecaution commented 3 years ago

We’ll try to reproduce the issue on our side. We’ll keep you posted. Thanks!

Thanks for your help!

jdemouth commented 3 years ago

We (NVIDIA engineers) were able to reproduce strange perplexity results and we are trying to identify the root cause. We will update you as we know more. Thanks for reporting the issue and for the reproducer.

hwijeen commented 3 years ago

Hi, I think #12004 is an related issue