taokz / BiomedGPT

BiomedGPT: A Generalist Vision-Language Foundation Model for Diverse Biomedical Tasks
https://www.nature.com/articles/s41591-024-03185-2
Apache License 2.0
455 stars 51 forks source link

I encountered a problem when reproducing the VQA task #32

Open huishao007 opened 2 weeks ago

huishao007 commented 2 weeks ago

After I deployed the environment as required, I encountered a problem when reproducing the VQA task. The following error occurred when running the evaluate_vqa_rad_beam_scale.sh file. I hope to get your help. Another point is that you said in your article that you used cuda12.2, but in the biomedgpt.yml on github, it is cuda11. Is it that the version on github is not your latest version? As a result, I encountered some problems when reproducing the downstream subtasks you provided.

My bash command is as follows: export MASTER_PORT=8082 user_dir=../../module bpe_dir=../../utils/BPE split=$1 data_dir=../../datasets/finetuning/vqa-rad data=${data_dir}/test.tsv ans2label_file=${data_dir}/trainval_ans2label.pkl

declare -a Scale=('tiny')

for scale in ${Scale[@]}; do if [[ $scale =~ "tiny" ]]; then patch_image_size=256 elif [[ $scale =~ "medium" ]]; then patch_image_size=256 elif [[ $scale =~ "base" ]]; then
patch_image_size=384 fi

path=../../checkpoint/biomedgpt_tiny.pt
result_path=./results/vqa_rad_beam/${scale}
mkdir -p $result_path
selected_cols=0,5,2,3,4

log_file=${result_path}/${scale}".log"

CUDA_VISIBLE_DEVICES=0 python3 -m torch.distributed.launch --nproc_per_node=1 --master_port=${MASTER_PORT} ../../evaluate.py \
    ${data} \
    --path=${path} \
    --user-dir=${user_dir} \
    --task=vqa_gen \
    --batch-size=64 \
    --log-format=simple --log-interval=100 \
    --seed=7 \
    --gen-subset=${split} \
    --results-path=${result_path} \
    --fp16 \
    --ema-eval \
    --beam-search-vqa-eval \
    --beam=1 \
    --unnormalized \
    --temperature=1.0 \
    --num-workers=0 \
    --model-overrides="{\"data\":\"${data}\",\"bpe_dir\":\"${bpe_dir}\",\"selected_cols\":\"${selected_cols}\",\"ans2label_file\":\"${ans2label_file}\"}" > ${log_file} 2>&1

The error is as follows: 2024-08-29 15:23:52 | INFO | fairseq.distributed.utils | distributed init (rank 0): env:// 2024-08-29 15:23:52 | INFO | fairseq.distributed.utils | Start init 2024-08-29 15:23:52 | INFO | torch.distributed.distributed_c10d | Added key: store_based_barrier_key:1 to store for rank: 0 2024-08-29 15:23:52 | INFO | torch.distributed.distributed_c10d | Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 1 nodes. 2024-08-29 15:23:52 | INFO | fairseq.distributed.utils | initialized host h3c as rank 0 single-machine distributed training is initialized. 2024-08-29 15:23:54 | INFO | ofa.evaluate | {'_name': None, 'common': {'_name': None, 'no_progress_bar': False, 'log_interval': 100, 'log_format': 'simple', 'log_file': None, 'tensorboard_logdir': None, 'wandb_project': None, 'azureml_logging': False, 'seed': 7, 'cpu': False, 'tpu': False, 'bf16': False, 'memory_efficient_bf16': False, 'fp16': True, 'memory_efficient_fp16': False, 'fp16_no_flatten_grads': False, 'fp16_init_scale': 128, 'fp16_scale_window': None, 'fp16_scale_tolerance': 0.0, 'on_cpu_convert_precision': False, 'min_loss_scale': 0.0001, 'threshold_loss_scale': None, 'amp': False, 'amp_batch_retries': 2, 'amp_init_scale': 128, 'amp_scale_window': None, 'user_dir': '../../module', 'empty_cache_freq': 0, 'all_gather_list_size': 16384, 'model_parallel_size': 1, 'quantization_config_path': None, 'profile': False, 'reset_logging': False, 'suppress_crashes': False, 'use_plasma_view': False, 'plasma_path': '/tmp/plasma'}, 'common_eval': {'_name': None, 'path': '../../checkpoint/biomedgpt_tiny.pt', 'post_process': None, 'quiet': False, 'model_overrides': '{"data":"../../datasets/finetuning/vqa-rad/test.tsv","bpe_dir":"../../utils/BPE","selected_cols":"0,5,2,3,4","ans2label_file":"../../datasets/finetuning/vqa-rad/trainval_ans2label.pkl"}', 'results_path': './results/vqa_rad_beam/tiny'}, 'distributed_training': {'_name': None, 'distributed_world_size': 1, 'distributed_num_procs': 1, 'distributed_rank': 0, 'distributed_backend': 'nccl', 'distributed_init_method': 'env://', 'distributed_port': -1, 'device_id': 0, 'distributed_no_spawn': True, 'ddp_backend': 'pytorch_ddp', 'ddp_comm_hook': 'none', 'bucket_cap_mb': 25, 'fix_batches_to_gpus': False, 'find_unused_parameters': False, 'gradient_as_bucket_view': False, 'fast_stat_sync': False, 'heartbeat_timeout': -1, 'broadcast_buffers': False, 'slowmo_momentum': None, 'slowmo_algorithm': 'LocalSGD', 'localsgd_frequency': 3, 'nprocs_per_node': 1, 'pipeline_model_parallel': False, 'pipeline_balance': None, 'pipeline_devices': None, 'pipeline_chunks': 0, 'pipeline_encoder_balance': None, 'pipeline_encoder_devices': None, 'pipeline_decoder_balance': None, 'pipeline_decoder_devices': None, 'pipeline_checkpoint': 'never', 'zero_sharding': 'none', 'fp16': True, 'memory_efficient_fp16': False, 'tpu': False, 'no_reshard_after_forward': False, 'fp32_reduce_scatter': False, 'cpu_offload': False, 'use_sharded_state': False}, 'dataset': {'_name': None, 'num_workers': 0, 'skip_invalid_size_inputs_valid_test': False, 'max_tokens': None, 'batch_size': 64, 'required_batch_size_multiple': 8, 'required_seq_len_multiple': 1, 'dataset_impl': None, 'data_buffer_size': 10, 'train_subset': 'train', 'valid_subset': 'valid', 'combine_valid_subsets': None, 'ignore_unused_valid_subsets': False, 'validate_interval': 1, 'validate_interval_updates': 0, 'validate_after_updates': 0, 'fixed_validation_seed': None, 'disable_validation': False, 'max_tokens_valid': None, 'batch_size_valid': 64, 'max_valid_steps': None, 'curriculum': 0, 'gen_subset': '', 'num_shards': 1, 'shard_id': 0}, 'optimization': {'_name': None, 'max_epoch': 0, 'max_update': 0, 'stop_time_hours': 0.0, 'clip_norm': 0.0, 'sentence_avg': False, 'update_freq': [1], 'lr': [0.25], 'stop_min_lr': -1.0, 'use_bmuf': False}, 'checkpoint': {'_name': None, 'save_dir': 'checkpoints', 'restore_file': 'checkpoint_last.pt', 'finetune_from_model': None, 'reset_dataloader': False, 'reset_lr_scheduler': False, 'reset_meters': False, 'reset_optimizer': False, 'optimizer_overrides': '{}', 'save_interval': 1, 'save_interval_updates': 0, 'keep_interval_updates': -1, 'keep_interval_updates_pattern': -1, 'keep_last_epochs': -1, 'keep_best_checkpoints': -1, 'no_save': False, 'no_epoch_checkpoints': False, 'no_last_checkpoints': False, 'no_save_optimizer_state': False, 'best_checkpoint_metric': 'loss', 'maximize_best_checkpoint_metric': False, 'patience': -1, 'checkpoint_suffix': '', 'checkpoint_shard_count': 1, 'load_checkpoint_on_all_dp_ranks': False, 'write_checkpoints_asynchronously': False, 'model_parallel_size': 1, 'use_ema_weights_to_init_param': False, 'use_latest_weights_to_init_ema': False}, 'bmuf': {'_name': None, 'block_lr': 1.0, 'block_momentum': 0.875, 'global_sync_iter': 50, 'warmup_iterations': 500, 'use_nbm': False, 'average_sync': False, 'distributed_world_size': 1}, 'generation': {'_name': None, 'beam': 1, 'nbest': 1, 'max_len_a': 0.0, 'max_len_b': 200, 'min_len': 1, 'match_source_len': False, 'unnormalized': True, 'no_early_stop': False, 'no_beamable_mm': False, 'lenpen': 1.0, 'unkpen': 0.0, 'replace_unk': None, 'sacrebleu': False, 'score_reference': False, 'prefix_size': 0, 'no_repeat_ngram_size': 0, 'sampling': False, 'sampling_topk': -1, 'sampling_topp': -1.0, 'constraints': None, 'temperature': 1.0, 'diverse_beam_groups': -1, 'diverse_beam_strength': 0.5, 'diversity_rate': -1.0, 'print_alignment': None, 'print_step': False, 'lm_path': None, 'lm_weight': 0.0, 'iter_decode_eos_penalty': 0.0, 'iter_decode_max_iter': 10, 'iter_decode_force_max_iter': False, 'iter_decode_with_beam': 1, 'iter_decode_with_external_reranker': False, 'retain_iter_history': False, 'retain_dropout': False, 'retain_dropout_modules': None, 'decoding_format': None, 'no_seed_provided': False}, 'eval_lm': {'_name': None, 'output_word_probs': False, 'output_word_stats': False, 'context_window': 0, 'softmax_batch': 9223372036854775807}, 'interactive': {'_name': None, 'buffer_size': 0, 'input': '-'}, 'model': None, 'task': {'_name': 'vqa_gen', 'data': '../../datasets/finetuning/vqa-rad/test.tsv', 'selected_cols': None, 'bpe': None, 'bpe_dir': None, 'max_source_positions': 1024, 'max_target_positions': 1024, 'max_src_length': 128, 'max_tgt_length': 30, 'code_dict_size': 8192, 'patch_image_size': 480, 'orig_patch_image_size': 256, 'num_bins': 1000, 'imagenet_default_mean_and_std': False, 'constraint_range': None, 'max_object_length': 30, 'ans2label_dict': '{"no": 0, "yes":1}', 'ans2label_file': None, 'unconstrained_training': False, 'add_object': False, 'valid_batch_size': 20, 'prompt_type': None, 'uses_ema': False, 'val_inference_type': 'allcand', 'eval_args': '{"beam":5,"unnormalized":true,"temperature":1.0}'}, 'criterion': {'_name': 'cross_entropy', 'sentence_avg': True}, 'optimizer': None, 'lr_scheduler': {'_name': 'fixed', 'force_anneal': None, 'lr_shrink': 0.1, 'warmup_updates': 0, 'lr': [0.25]}, 'scoring': {'_name': 'bleu', 'pad': 1, 'eos': 2, 'unk': 3}, 'bpe': None, 'tokenizer': None, 'ema': {'_name': None, 'store_ema': False, 'ema_decay': 0.9999, 'ema_start_update': 0, 'ema_seed_model': None, 'ema_update_freq': 1, 'ema_fp32': False}, 'simul_type': None} 2024-08-29 15:23:54 | INFO | ofa.evaluate | loading model(s) from ../../checkpoint/biomedgpt_tiny.pt 2024-08-29 15:23:54 | INFO | tasks.ofa_task | source dictionary: 59457 types 2024-08-29 15:23:54 | INFO | tasks.ofa_task | target dictionary: 59457 types Traceback (most recent call last): File "../../evaluate.py", line 174, in cli_main() File "../../evaluate.py", line 169, in cli_main cfg, main, ema_eval=args.ema_eval, beam_search_vqa_eval=args.beam_search_vqa_eval, zero_shot=args.zero_shot File "/alldata/Htang_data/workspace/project/BiomedGPT/fairseq/fairseq/distributed/utils.py", line 374, in call_main distributed_main(cfg.distributed_training.device_id, main, cfg, kwargs) File "/alldata/Htang_data/workspace/project/BiomedGPT/fairseq/fairseq/distributed/utils.py", line 348, in distributed_main main(cfg, kwargs) File "../../evaluate.py", line 82, in main num_shards=cfg.checkpoint.checkpoint_shard_count, File "/alldata/Htang_data/workspace/project/BiomedGPT/utils/checkpoint_utils.py", line 447, in load_model_ensemble_and_task task = tasks.setup_task(cfg.task) File "/alldata/Htang_data/workspace/project/BiomedGPT/fairseq/fairseq/tasks/init.py", line 46, in setup_task return task.setup_task(cfg, kwargs) File "/alldata/Htang_data/workspace/project/BiomedGPT/tasks/ofa_task.py", line 111, in setup_task return cls(cfg, src_dict, tgt_dict) File "/alldata/Htang_data/workspace/project/BiomedGPT/tasks/pretrain_tasks/unify_task.py", line 97, in init self.type2ans_dict = json.load(open(os.path.join(self.cfg.neg_sample_dir, 'type2ans.json'))) FileNotFoundError: [Errno 2] No such file or directory: '/data/omnimed/pretrain_data/negative_sample/type2ans.json' /home/Htang/anaconda3/envs/biomedgpt/lib/python3.7/site-packages/torch/distributed/launch.py:188: FutureWarning: The module torch.distributed.launch is deprecated and will be removed in future. Use torchrun. Note that --use_env is set by default in torchrun. If your script expects --local_rank argument to be set, please change it to read from os.environ['LOCAL_RANK'] instead. See https://pytorch.org/docs/stable/distributed.html#launch-utility for further instructions FutureWarning, ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 2983314) of binary: /home/Htang/anaconda3/envs/biomedgpt/bin/python3 Traceback (most recent call last): File "/home/Htang/anaconda3/envs/biomedgpt/lib/python3.7/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/home/Htang/anaconda3/envs/biomedgpt/lib/python3.7/runpy.py", line 85, in _run_code exec(code, run_globals) File "/home/Htang/anaconda3/envs/biomedgpt/lib/python3.7/site-packages/torch/distributed/launch.py", line 195, in main() File "/home/Htang/anaconda3/envs/biomedgpt/lib/python3.7/site-packages/torch/distributed/launch.py", line 191, in main launch(args) File "/home/Htang/anaconda3/envs/biomedgpt/lib/python3.7/site-packages/torch/distributed/launch.py", line 176, in launch run(args) File "/home/Htang/anaconda3/envs/biomedgpt/lib/python3.7/site-packages/torch/distributed/run.py", line 756, in run )(*cmd_args) File "/home/Htang/anaconda3/envs/biomedgpt/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 132, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/Htang/anaconda3/envs/biomedgpt/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 248, in launch_agent failures=result.failures, torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

../../evaluate.py FAILED

Failures:

------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-08-29_15:23:58 host : h3c rank : 0 (local_rank: 0) exitcode : 1 (pid: 2983314) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================
taokz commented 2 weeks ago

The initial results were obtained using CUDA 11, as specified in the biomedgpt.yml file. Later, due to an automatic server upgrade, CUDA was updated to version 12. However, I continued running the code, and both CUDA 11 and CUDA 12 have been verified to work with the current implementation.

Regarding the error you encountered, it is due to the absence of the self.cfg.neg_sample_dir and type2ans.json files. Please refer to the previous response in Issue #10 for further details.

taokz commented 6 days ago

@huishao007 Has your issue been resolved? I think you might be using the pre-trained checkpoint instead of the fine-tuned one for vqa prediction. Using the pre-trained checkpoint may lead to the issue you’re experiencing. You can try setting the --zero-shot flag in the script, as mentioned in issue #19 as well.

CUDA_VISIBLE_DEVICES=0 python3 -m torch.distributed.launch --nproc_per_node=1 --master_port=${MASTER_PORT} ../../evaluate.py \
                ${data} \
                --path=${path} \
                --user-dir=${user_dir} \
                --task=vqa_gen \
                --selected-cols=${selected_cols} \
                --bpe-dir=${bpe_dir} \
                --patch-image-size=480 \
                --prompt-type='none' \
                --batch-size=8 \
                --log-format=simple --log-interval=10 \
                --seed=7 \
                --gen-subset=${split} \
                --results-path=${result_path} \
                --fp16 \
                **--zero-shot \**
                --beam=${beam_size} \
                --unnormalized \
                --temperature=1.0 \
                --num-workers=0 \
                > ${log_file} 2>&1