microsoft / unilm

Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities
https://aka.ms/GeneralAI
MIT License
20.28k stars 2.56k forks source link

[BEIT-3] error happens when I evaluate BEiT-3 finetuned model on VQAv2 #1597

Closed matsutaku44 closed 4 months ago

matsutaku44 commented 4 months ago

Describe Model I am using (UniLM, MiniLM, LayoutLM ...): BEIT-3

I want to evaluate BEiT-3 finetuned model on VQAv2. https://github.com/microsoft/unilm/blob/master/beit3/get_started/get_started_for_vqav2.md#example-evaluate-beit-3-finetuned-model-on-vqav2-visual-question-answering

However, error happens. I cannot understand what this error message means. How do I solve this problem? Please help me. Thank you for sharing codes of BEIT-3.

(beit3) matsuzaki.takumi@docker:~/workspace/vqa/unilm/beit3$ python -m torch.distributed.launch --nproc_per_node=2 run_beit3_finetuning.py \
>         --model beit3_base_patch16_480 \
>         --input_size 480 \
>         --task vqav2 \
>         --batch_size 4 \
>         --sentencepiece_model /mnt/new_mensa/data/VQAv2/BEIT3/beit3.spm \
>         --finetune /mnt/new_mensa/data/VQAv2/BEIT3/beit3_base_indomain_patch16_224.pth \
>         --data_path /mnt/new_mensa/data/VQAv2 \
>         --output_dir ./prediction_saveHere \
>         --eval \
>         --dist_eval
/home/matsuzaki.takumi/.conda/envs/beit3/lib/python3.8/site-packages/torch/distributed/launch.py:183: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use-env is set by default in torchrun.
If your script expects `--local-rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions

  warnings.warn(
W0706 05:15:39.667715 140419543589120 torch/distributed/run.py:757]
W0706 05:15:39.667715 140419543589120 torch/distributed/run.py:757] *****************************************
W0706 05:15:39.667715 140419543589120 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0706 05:15:39.667715 140419543589120 torch/distributed/run.py:757] *****************************************
usage: BEiT fine-tuning and evaluation script for image classification
       [--model MODEL] --task
       {nlvr2,vqav2,flickr30k,coco_retrieval,coco_captioning,nocaps,imagenet}
       [--input_size INPUT_SIZE] [--drop_path PCT]
       [--checkpoint_activations] --sentencepiece_model
       SENTENCEPIECE_MODEL [--vocab_size VOCAB_SIZE]
       [--num_max_bpe_tokens NUM_MAX_BPE_TOKENS] [--model_ema]
       [--model_ema_decay MODEL_EMA_DECAY] [--model_ema_force_cpu]
       [--opt OPTIMIZER] [--opt_eps EPSILON] [--opt_betas BETA [BETA ...]]
       [--clip_grad NORM] [--momentum M] [--weight_decay WEIGHT_DECAY]
       [--lr LR] [--layer_decay LAYER_DECAY]
       [--task_head_lr_weight TASK_HEAD_LR_WEIGHT] [--warmup_lr LR]
       [--min_lr LR] [--warmup_epochs N] [--warmup_steps N]
       [--batch_size BATCH_SIZE] [--eval_batch_size EVAL_BATCH_SIZE]
       [--epochs EPOCHS] [--update_freq UPDATE_FREQ]
       [--save_ckpt_freq SAVE_CKPT_FREQ] [--randaug]
       [--train_interpolation TRAIN_INTERPOLATION] [--finetune FINETUNE]
       [--model_key MODEL_KEY] [--model_prefix MODEL_PREFIX]
       [--data_path DATA_PATH] [--output_dir OUTPUT_DIR]
       [--log_dir LOG_DIR] [--device DEVICE] [--seed SEED]
       [--resume RESUME] [--auto_resume] [--no_auto_resume] [--save_ckpt]
       [--no_save_ckpt] [--start_epoch N] [--eval] [--dist_eval]
       [--num_workers NUM_WORKERS] [--pin_mem] [--no_pin_mem]
       [--world_size WORLD_SIZE] [--local_rank LOCAL_RANK] [--dist_on_itp]
       [--dist_url DIST_URL] [--task_cache_path TASK_CACHE_PATH]
       [--nb_classes NB_CLASSES] [--mixup MIXUP] [--cutmix CUTMIX]
       [--cutmix_minmax CUTMIX_MINMAX [CUTMIX_MINMAX ...]]
       [--mixup_prob MIXUP_PROB] [--mixup_switch_prob MIXUP_SWITCH_PROB]
       [--mixup_mode MIXUP_MODE] [--color_jitter PCT] [--aa NAME]
       [--smoothing SMOOTHING] [--crop_pct CROP_PCT] [--reprob PCT]
       [--remode REMODE] [--recount RECOUNT] [--resplit]
       [--captioning_mask_prob CAPTIONING_MASK_PROB]
       [--drop_worst_ratio DROP_WORST_RATIO]
       [--drop_worst_after DROP_WORST_AFTER] [--num_beams NUM_BEAMS]
       [--length_penalty LENGTH_PENALTY]
       [--label_smoothing LABEL_SMOOTHING] [--enable_deepspeed]
       [--initial_scale_power INITIAL_SCALE_POWER]
       [--zero_stage ZERO_STAGE]
BEiT fine-tuning and evaluation script for image classification: error: unrecognized arguments: --local-rank=0
usage: BEiT fine-tuning and evaluation script for image classification
       [--model MODEL] --task
       {nlvr2,vqav2,flickr30k,coco_retrieval,coco_captioning,nocaps,imagenet}
       [--input_size INPUT_SIZE] [--drop_path PCT]
       [--checkpoint_activations] --sentencepiece_model
       SENTENCEPIECE_MODEL [--vocab_size VOCAB_SIZE]
       [--num_max_bpe_tokens NUM_MAX_BPE_TOKENS] [--model_ema]
       [--model_ema_decay MODEL_EMA_DECAY] [--model_ema_force_cpu]
       [--opt OPTIMIZER] [--opt_eps EPSILON] [--opt_betas BETA [BETA ...]]
       [--clip_grad NORM] [--momentum M] [--weight_decay WEIGHT_DECAY]
       [--lr LR] [--layer_decay LAYER_DECAY]
       [--task_head_lr_weight TASK_HEAD_LR_WEIGHT] [--warmup_lr LR]
       [--min_lr LR] [--warmup_epochs N] [--warmup_steps N]
       [--batch_size BATCH_SIZE] [--eval_batch_size EVAL_BATCH_SIZE]
       [--epochs EPOCHS] [--update_freq UPDATE_FREQ]
       [--save_ckpt_freq SAVE_CKPT_FREQ] [--randaug]
       [--train_interpolation TRAIN_INTERPOLATION] [--finetune FINETUNE]
       [--model_key MODEL_KEY] [--model_prefix MODEL_PREFIX]
       [--data_path DATA_PATH] [--output_dir OUTPUT_DIR]
       [--log_dir LOG_DIR] [--device DEVICE] [--seed SEED]
       [--resume RESUME] [--auto_resume] [--no_auto_resume] [--save_ckpt]
       [--no_save_ckpt] [--start_epoch N] [--eval] [--dist_eval]
       [--num_workers NUM_WORKERS] [--pin_mem] [--no_pin_mem]
       [--world_size WORLD_SIZE] [--local_rank LOCAL_RANK] [--dist_on_itp]
       [--dist_url DIST_URL] [--task_cache_path TASK_CACHE_PATH]
       [--nb_classes NB_CLASSES] [--mixup MIXUP] [--cutmix CUTMIX]
       [--cutmix_minmax CUTMIX_MINMAX [CUTMIX_MINMAX ...]]
       [--mixup_prob MIXUP_PROB] [--mixup_switch_prob MIXUP_SWITCH_PROB]
       [--mixup_mode MIXUP_MODE] [--color_jitter PCT] [--aa NAME]
       [--smoothing SMOOTHING] [--crop_pct CROP_PCT] [--reprob PCT]
       [--remode REMODE] [--recount RECOUNT] [--resplit]
       [--captioning_mask_prob CAPTIONING_MASK_PROB]
       [--drop_worst_ratio DROP_WORST_RATIO]
       [--drop_worst_after DROP_WORST_AFTER] [--num_beams NUM_BEAMS]
       [--length_penalty LENGTH_PENALTY]
       [--label_smoothing LABEL_SMOOTHING] [--enable_deepspeed]
       [--initial_scale_power INITIAL_SCALE_POWER]
       [--zero_stage ZERO_STAGE]
BEiT fine-tuning and evaluation script for image classification: error: unrecognized arguments: --local-rank=1
E0706 05:15:44.682819 140419543589120 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 2) local_rank: 0 (pid: 76) of binary: /home/matsuzaki.takumi/.conda/envs/beit3/bin/python
Traceback (most recent call last):
  File "/home/matsuzaki.takumi/.conda/envs/beit3/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/matsuzaki.takumi/.conda/envs/beit3/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/matsuzaki.takumi/.conda/envs/beit3/lib/python3.8/site-packages/torch/distributed/launch.py", line 198, in <module>
    main()
  File "/home/matsuzaki.takumi/.conda/envs/beit3/lib/python3.8/site-packages/torch/distributed/launch.py", line 194, in main
    launch(args)
  File "/home/matsuzaki.takumi/.conda/envs/beit3/lib/python3.8/site-packages/torch/distributed/launch.py", line 179, in launch
    run(args)
  File "/home/matsuzaki.takumi/.conda/envs/beit3/lib/python3.8/site-packages/torch/distributed/run.py", line 870, in run
    elastic_launch(
  File "/home/matsuzaki.takumi/.conda/envs/beit3/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/matsuzaki.takumi/.conda/envs/beit3/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
run_beit3_finetuning.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2024-07-06_05:15:44
  host      : docker
  rank      : 1 (local_rank: 1)
  exitcode  : 2 (pid: 77)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-07-06_05:15:44
  host      : docker
  rank      : 0 (local_rank: 0)
  exitcode  : 2 (pid: 76)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
Sv3n01 commented 4 months ago

Changing "python -m torch.distributed.launch --nproc_per_node=2 run_beit3_finetuning.py" to "python -m run_beit3_finetuning" solved it for me in google colab.

matsutaku44 commented 4 months ago

@Sv3n01 Thank you for replying! I am trying this change now.

matsutaku44 commented 4 months ago

I removed "torch.distributed.launch --nproc_per_node=2" and run again. Then, Evaluation seemed to be started. Thank you very much! However, a different error happens.

I run this code.

python run_beit3_finetuning.py \
        --model beit3_base_patch16_480 \
        --input_size 480 \
        --task vqav2 \
        --batch_size 16 \
        --sentencepiece_model ../../../../new_mensa/data/VQAv2/BEIT3/beit3.spm \
        --finetune ../../../../new_mensa/data/VQAv2/BEIT3/beit3_base_indomain_patch16_224.pth \
        --data_path ../../../../new_mensa/data/VQAv2 \
        --output_dir ./prediction_saveHere \
        --eval \
        --dist_eval

The error

. . .

Test:  [18640/18659]  eta: 0:00:05    time: 0.2775  data: 0.0002  max mem: 3774
Test:  [18650/18659]  eta: 0:00:02    time: 0.2774  data: 0.0002  max mem: 3774
Test:  [18658/18659]  eta: 0:00:00    time: 0.2658  data: 0.0000  max mem: 3774
Test: Total time: 1:26:48 (0.2792 s / it)
Traceback (most recent call last):
  File "run_beit3_finetuning.py", line 448, in <module>
    main(opts, ds_init)
  File "run_beit3_finetuning.py", line 365, in main
    utils.dump_predictions(args, result, "vqav2_test")
  File "/home/matsuzaki.takumi/workspace/vqa/unilm/beit3/utils.py", line 845, in dump_predictions
    torch.distributed.barrier()
  File "/home/matsuzaki.takumi/.conda/envs/beit3-3.8/lib/python3.8/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
    return func(*args, **kwargs)
  File "/home/matsuzaki.takumi/.conda/envs/beit3-3.8/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 3672, in barrier
    opts.device = _get_pg_default_device(group)
  File "/home/matsuzaki.takumi/.conda/envs/beit3-3.8/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 649, in _get_pg_default_device
    group = group or _get_default_group()
  File "/home/matsuzaki.takumi/.conda/envs/beit3-3.8/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1008, in _get_default_group
    raise ValueError(
ValueError: Default process group has not been initialized, please make sure to call init_process_group.

I am trying to solve this problem now. Do you have a solution? Please teach me.

matsutaku44 commented 4 months ago

I can get submit_vqav2_test.json (the list of pairs of question_id and answer).

I write this in run_beit3_finetuning.py (line 141) parser.add_argument("--local-rank", type=int)

Then, I run this code. (Maybe you should not omit "-m torch.distributed.launch --nproc_per_node=2")

python -m torch.distributed.launch --nproc_per_node=2 run_beit3_finetuning.py \
        --model beit3_base_patch16_480 \
        --input_size 480 \
        --task vqav2 \
        --batch_size 16 \
        --sentencepiece_model ../../../../new_mensa/data/VQAv2/BEIT3/beit3.spm \
        --finetune ../../../../new_mensa/data/VQAv2/BEIT3/beit3_base_indomain_patch16_224.pth \
        --data_path ../../../../new_mensa/data/VQAv2 \
        --output_dir ./prediction_saveHere \
        --eval \
        --dist_eval

Then, I can get submit_vqav2_test.json

. . . 

Test:  [9310/9330]  eta: 0:00:05    time: 0.2790  data: 0.0002  max mem: 4665
Test:  [9320/9330]  eta: 0:00:02    time: 0.2789  data: 0.0002  max mem: 4665
Test:  [9329/9330]  eta: 0:00:00    time: 0.2674  data: 0.0001  max mem: 4665
Test: Total time: 0:43:23 (0.2790 s / it)
Infer 447793 examples into ./prediction_saveHere/submit_vqav2_test.json

I don't know why I can get the json file. But, I close this issue.