microsoft / unilm

Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities
https://aka.ms/GeneralAI
MIT License
19.54k stars 2.49k forks source link

Error when loading dall_e models for Beit #668

Closed Zhaoyi-Yan closed 2 years ago

Zhaoyi-Yan commented 2 years ago

Describe Beit, I am using Beit, when tried to run the code on imagenet-1k, then

Traceback (most recent call last):
  File "run_beit_pretraining.py", line 280, in <module>
Traceback (most recent call last):
  File "run_beit_pretraining.py", line 280, in <module>
    main(opts)
  File "run_beit_pretraining.py", line 177, in main
    main(opts)
  File "run_beit_pretraining.py", line 177, in main
    device=device, image_size=args.second_input_size)
  File "/userhome/yzy/Beit/unilm/beit/utils.py", line 527, in create_d_vae
    device=device, image_size=args.second_input_size)
  File "/userhome/yzy/Beit/unilm/beit/utils.py", line 527, in create_d_vae
    return get_dalle_vae(weight_path, image_size, device)
  File "/userhome/yzy/Beit/unilm/beit/utils.py", line 536, in get_dalle_vae
    return get_dalle_vae(weight_path, image_size, device)
  File "/userhome/yzy/Beit/unilm/beit/utils.py", line 536, in get_dalle_vae
    vae.load_model(model_dir=weight_path, device=device)
  File "/userhome/yzy/Beit/unilm/beit/modeling_discrete_vae.py", line 214, in load_model
    vae.load_model(model_dir=weight_path, device=device)
  File "/userhome/yzy/Beit/unilm/beit/modeling_discrete_vae.py", line 214, in load_model
    self.encoder = load_model(os.path.join(model_dir, "encoder.pkl"), device)
  File "/userhome/yzy/Beit/unilm/beit/dall_e/__init__.py", line 18, in load_model
    self.encoder = load_model(os.path.join(model_dir, "encoder.pkl"), device)
  File "/userhome/yzy/Beit/unilm/beit/dall_e/__init__.py", line 18, in load_model
    return torch.load(f, map_location=device)
  File "/opt/conda/lib/python3.6/site-packages/torch/serialization.py", line 595, in load
    return torch.load(f, map_location=device)
  File "/opt/conda/lib/python3.6/site-packages/torch/serialization.py", line 595, in load
    return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
  File "/opt/conda/lib/python3.6/site-packages/torch/serialization.py", line 764, in _legacy_load
    return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
  File "/opt/conda/lib/python3.6/site-packages/torch/serialization.py", line 764, in _legacy_load
    magic_number = pickle_module.load(f, **pickle_load_args)
_pickle.UnpicklingError: invalid load key, '-'.
    magic_number = pickle_module.load(f, **pickle_load_args)
_pickle.UnpicklingError: invalid load key, '-'.

The training script is

#!/usr/bin/env bash

# Set the path to save checkpoints
OUTPUT_DIR='./output/tmp'
# Download and extract ImageNet-1k
DATA_PATH='/userhome/imagenet1k/'
# Download the tokenizer weight from OpenAI's DALL-E
TOKENIZER_PATH='./dalle_model/'
mkdir -p $TOKENIZER_PATH
wget -o $TOKENIZER_PATH/encoder.pkl https://cdn.openai.com/dall-e/encoder.pkl
wget -o $TOKENIZER_PATH/decoder.pkl https://cdn.openai.com/dall-e/decoder.pkl

OMP_NUM_THREADS=1 python -m torch.distributed.launch --nproc_per_node=2 run_beit_pretraining.py \
        --data_path ${DATA_PATH} --output_dir ${OUTPUT_DIR} --num_mask_patches 75 \
        --model beit_base_patch16_224_8k_vocab --discrete_vae_weight_path ${TOKENIZER_PATH} \
        --batch_size 4 --lr 1.5e-3 --warmup_epochs 10 --epochs 800 \
        --clip_grad 3.0 --drop_path 0.1 --layer_scale_init_value 0.1 \
        --imagenet_default_mean_and_std
addf400 commented 2 years ago

Hi @Zhaoyi-Yan , Thanks for reporting this issue. It seems that the weight files corresponding to the URL in readme have been modified, and I have uploaded a historical version here that has been tested to work properly. The readme URL will be adjusted soon.

donglixp commented 2 years ago

Solved by the P&R at https://github.com/microsoft/unilm/pull/670#event-6324997330