Open huishao007 opened 2 weeks ago
The initial results were obtained using CUDA 11, as specified in the biomedgpt.yml file. Later, due to an automatic server upgrade, CUDA was updated to version 12. However, I continued running the code, and both CUDA 11 and CUDA 12 have been verified to work with the current implementation.
Regarding the error you encountered, it is due to the absence of the self.cfg.neg_sample_dir and type2ans.json files. Please refer to the previous response in Issue #10 for further details.
@huishao007 Has your issue been resolved? I think you might be using the pre-trained checkpoint instead of the fine-tuned one for vqa prediction. Using the pre-trained checkpoint may lead to the issue you’re experiencing. You can try setting the --zero-shot flag in the script, as mentioned in issue #19 as well.
CUDA_VISIBLE_DEVICES=0 python3 -m torch.distributed.launch --nproc_per_node=1 --master_port=${MASTER_PORT} ../../evaluate.py \
${data} \
--path=${path} \
--user-dir=${user_dir} \
--task=vqa_gen \
--selected-cols=${selected_cols} \
--bpe-dir=${bpe_dir} \
--patch-image-size=480 \
--prompt-type='none' \
--batch-size=8 \
--log-format=simple --log-interval=10 \
--seed=7 \
--gen-subset=${split} \
--results-path=${result_path} \
--fp16 \
**--zero-shot \**
--beam=${beam_size} \
--unnormalized \
--temperature=1.0 \
--num-workers=0 \
> ${log_file} 2>&1
After I deployed the environment as required, I encountered a problem when reproducing the VQA task. The following error occurred when running the evaluate_vqa_rad_beam_scale.sh file. I hope to get your help. Another point is that you said in your article that you used cuda12.2, but in the biomedgpt.yml on github, it is cuda11. Is it that the version on github is not your latest version? As a result, I encountered some problems when reproducing the downstream subtasks you provided.
My bash command is as follows: export MASTER_PORT=8082 user_dir=../../module bpe_dir=../../utils/BPE split=$1 data_dir=../../datasets/finetuning/vqa-rad data=${data_dir}/test.tsv ans2label_file=${data_dir}/trainval_ans2label.pkl
declare -a Scale=('tiny')
for scale in ${Scale[@]}; do if [[ $scale =~ "tiny" ]]; then patch_image_size=256 elif [[ $scale =~ "medium" ]]; then patch_image_size=256 elif [[ $scale =~ "base" ]]; then
patch_image_size=384 fi
The error is as follows: 2024-08-29 15:23:52 | INFO | fairseq.distributed.utils | distributed init (rank 0): env:// 2024-08-29 15:23:52 | INFO | fairseq.distributed.utils | Start init 2024-08-29 15:23:52 | INFO | torch.distributed.distributed_c10d | Added key: store_based_barrier_key:1 to store for rank: 0 2024-08-29 15:23:52 | INFO | torch.distributed.distributed_c10d | Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 1 nodes. 2024-08-29 15:23:52 | INFO | fairseq.distributed.utils | initialized host h3c as rank 0 single-machine distributed training is initialized. 2024-08-29 15:23:54 | INFO | ofa.evaluate | {'_name': None, 'common': {'_name': None, 'no_progress_bar': False, 'log_interval': 100, 'log_format': 'simple', 'log_file': None, 'tensorboard_logdir': None, 'wandb_project': None, 'azureml_logging': False, 'seed': 7, 'cpu': False, 'tpu': False, 'bf16': False, 'memory_efficient_bf16': False, 'fp16': True, 'memory_efficient_fp16': False, 'fp16_no_flatten_grads': False, 'fp16_init_scale': 128, 'fp16_scale_window': None, 'fp16_scale_tolerance': 0.0, 'on_cpu_convert_precision': False, 'min_loss_scale': 0.0001, 'threshold_loss_scale': None, 'amp': False, 'amp_batch_retries': 2, 'amp_init_scale': 128, 'amp_scale_window': None, 'user_dir': '../../module', 'empty_cache_freq': 0, 'all_gather_list_size': 16384, 'model_parallel_size': 1, 'quantization_config_path': None, 'profile': False, 'reset_logging': False, 'suppress_crashes': False, 'use_plasma_view': False, 'plasma_path': '/tmp/plasma'}, 'common_eval': {'_name': None, 'path': '../../checkpoint/biomedgpt_tiny.pt', 'post_process': None, 'quiet': False, 'model_overrides': '{"data":"../../datasets/finetuning/vqa-rad/test.tsv","bpe_dir":"../../utils/BPE","selected_cols":"0,5,2,3,4","ans2label_file":"../../datasets/finetuning/vqa-rad/trainval_ans2label.pkl"}', 'results_path': './results/vqa_rad_beam/tiny'}, 'distributed_training': {'_name': None, 'distributed_world_size': 1, 'distributed_num_procs': 1, 'distributed_rank': 0, 'distributed_backend': 'nccl', 'distributed_init_method': 'env://', 'distributed_port': -1, 'device_id': 0, 'distributed_no_spawn': True, 'ddp_backend': 'pytorch_ddp', 'ddp_comm_hook': 'none', 'bucket_cap_mb': 25, 'fix_batches_to_gpus': False, 'find_unused_parameters': False, 'gradient_as_bucket_view': False, 'fast_stat_sync': False, 'heartbeat_timeout': -1, 'broadcast_buffers': False, 'slowmo_momentum': None, 'slowmo_algorithm': 'LocalSGD', 'localsgd_frequency': 3, 'nprocs_per_node': 1, 'pipeline_model_parallel': False, 'pipeline_balance': None, 'pipeline_devices': None, 'pipeline_chunks': 0, 'pipeline_encoder_balance': None, 'pipeline_encoder_devices': None, 'pipeline_decoder_balance': None, 'pipeline_decoder_devices': None, 'pipeline_checkpoint': 'never', 'zero_sharding': 'none', 'fp16': True, 'memory_efficient_fp16': False, 'tpu': False, 'no_reshard_after_forward': False, 'fp32_reduce_scatter': False, 'cpu_offload': False, 'use_sharded_state': False}, 'dataset': {'_name': None, 'num_workers': 0, 'skip_invalid_size_inputs_valid_test': False, 'max_tokens': None, 'batch_size': 64, 'required_batch_size_multiple': 8, 'required_seq_len_multiple': 1, 'dataset_impl': None, 'data_buffer_size': 10, 'train_subset': 'train', 'valid_subset': 'valid', 'combine_valid_subsets': None, 'ignore_unused_valid_subsets': False, 'validate_interval': 1, 'validate_interval_updates': 0, 'validate_after_updates': 0, 'fixed_validation_seed': None, 'disable_validation': False, 'max_tokens_valid': None, 'batch_size_valid': 64, 'max_valid_steps': None, 'curriculum': 0, 'gen_subset': '', 'num_shards': 1, 'shard_id': 0}, 'optimization': {'_name': None, 'max_epoch': 0, 'max_update': 0, 'stop_time_hours': 0.0, 'clip_norm': 0.0, 'sentence_avg': False, 'update_freq': [1], 'lr': [0.25], 'stop_min_lr': -1.0, 'use_bmuf': False}, 'checkpoint': {'_name': None, 'save_dir': 'checkpoints', 'restore_file': 'checkpoint_last.pt', 'finetune_from_model': None, 'reset_dataloader': False, 'reset_lr_scheduler': False, 'reset_meters': False, 'reset_optimizer': False, 'optimizer_overrides': '{}', 'save_interval': 1, 'save_interval_updates': 0, 'keep_interval_updates': -1, 'keep_interval_updates_pattern': -1, 'keep_last_epochs': -1, 'keep_best_checkpoints': -1, 'no_save': False, 'no_epoch_checkpoints': False, 'no_last_checkpoints': False, 'no_save_optimizer_state': False, 'best_checkpoint_metric': 'loss', 'maximize_best_checkpoint_metric': False, 'patience': -1, 'checkpoint_suffix': '', 'checkpoint_shard_count': 1, 'load_checkpoint_on_all_dp_ranks': False, 'write_checkpoints_asynchronously': False, 'model_parallel_size': 1, 'use_ema_weights_to_init_param': False, 'use_latest_weights_to_init_ema': False}, 'bmuf': {'_name': None, 'block_lr': 1.0, 'block_momentum': 0.875, 'global_sync_iter': 50, 'warmup_iterations': 500, 'use_nbm': False, 'average_sync': False, 'distributed_world_size': 1}, 'generation': {'_name': None, 'beam': 1, 'nbest': 1, 'max_len_a': 0.0, 'max_len_b': 200, 'min_len': 1, 'match_source_len': False, 'unnormalized': True, 'no_early_stop': False, 'no_beamable_mm': False, 'lenpen': 1.0, 'unkpen': 0.0, 'replace_unk': None, 'sacrebleu': False, 'score_reference': False, 'prefix_size': 0, 'no_repeat_ngram_size': 0, 'sampling': False, 'sampling_topk': -1, 'sampling_topp': -1.0, 'constraints': None, 'temperature': 1.0, 'diverse_beam_groups': -1, 'diverse_beam_strength': 0.5, 'diversity_rate': -1.0, 'print_alignment': None, 'print_step': False, 'lm_path': None, 'lm_weight': 0.0, 'iter_decode_eos_penalty': 0.0, 'iter_decode_max_iter': 10, 'iter_decode_force_max_iter': False, 'iter_decode_with_beam': 1, 'iter_decode_with_external_reranker': False, 'retain_iter_history': False, 'retain_dropout': False, 'retain_dropout_modules': None, 'decoding_format': None, 'no_seed_provided': False}, 'eval_lm': {'_name': None, 'output_word_probs': False, 'output_word_stats': False, 'context_window': 0, 'softmax_batch': 9223372036854775807}, 'interactive': {'_name': None, 'buffer_size': 0, 'input': '-'}, 'model': None, 'task': {'_name': 'vqa_gen', 'data': '../../datasets/finetuning/vqa-rad/test.tsv', 'selected_cols': None, 'bpe': None, 'bpe_dir': None, 'max_source_positions': 1024, 'max_target_positions': 1024, 'max_src_length': 128, 'max_tgt_length': 30, 'code_dict_size': 8192, 'patch_image_size': 480, 'orig_patch_image_size': 256, 'num_bins': 1000, 'imagenet_default_mean_and_std': False, 'constraint_range': None, 'max_object_length': 30, 'ans2label_dict': '{"no": 0, "yes":1}', 'ans2label_file': None, 'unconstrained_training': False, 'add_object': False, 'valid_batch_size': 20, 'prompt_type': None, 'uses_ema': False, 'val_inference_type': 'allcand', 'eval_args': '{"beam":5,"unnormalized":true,"temperature":1.0}'}, 'criterion': {'_name': 'cross_entropy', 'sentence_avg': True}, 'optimizer': None, 'lr_scheduler': {'_name': 'fixed', 'force_anneal': None, 'lr_shrink': 0.1, 'warmup_updates': 0, 'lr': [0.25]}, 'scoring': {'_name': 'bleu', 'pad': 1, 'eos': 2, 'unk': 3}, 'bpe': None, 'tokenizer': None, 'ema': {'_name': None, 'store_ema': False, 'ema_decay': 0.9999, 'ema_start_update': 0, 'ema_seed_model': None, 'ema_update_freq': 1, 'ema_fp32': False}, 'simul_type': None} 2024-08-29 15:23:54 | INFO | ofa.evaluate | loading model(s) from ../../checkpoint/biomedgpt_tiny.pt 2024-08-29 15:23:54 | INFO | tasks.ofa_task | source dictionary: 59457 types 2024-08-29 15:23:54 | INFO | tasks.ofa_task | target dictionary: 59457 types Traceback (most recent call last): File "../../evaluate.py", line 174, in
cli_main()
File "../../evaluate.py", line 169, in cli_main
cfg, main, ema_eval=args.ema_eval, beam_search_vqa_eval=args.beam_search_vqa_eval, zero_shot=args.zero_shot
File "/alldata/Htang_data/workspace/project/BiomedGPT/fairseq/fairseq/distributed/utils.py", line 374, in call_main
distributed_main(cfg.distributed_training.device_id, main, cfg, kwargs)
File "/alldata/Htang_data/workspace/project/BiomedGPT/fairseq/fairseq/distributed/utils.py", line 348, in distributed_main
main(cfg, kwargs)
File "../../evaluate.py", line 82, in main
num_shards=cfg.checkpoint.checkpoint_shard_count,
File "/alldata/Htang_data/workspace/project/BiomedGPT/utils/checkpoint_utils.py", line 447, in load_model_ensemble_and_task
task = tasks.setup_task(cfg.task)
File "/alldata/Htang_data/workspace/project/BiomedGPT/fairseq/fairseq/tasks/init.py", line 46, in setup_task
return task.setup_task(cfg, kwargs)
File "/alldata/Htang_data/workspace/project/BiomedGPT/tasks/ofa_task.py", line 111, in setup_task
return cls(cfg, src_dict, tgt_dict)
File "/alldata/Htang_data/workspace/project/BiomedGPT/tasks/pretrain_tasks/unify_task.py", line 97, in init
self.type2ans_dict = json.load(open(os.path.join(self.cfg.neg_sample_dir, 'type2ans.json')))
FileNotFoundError: [Errno 2] No such file or directory: '/data/omnimed/pretrain_data/negative_sample/type2ans.json'
/home/Htang/anaconda3/envs/biomedgpt/lib/python3.7/site-packages/torch/distributed/launch.py:188: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects
main()
File "/home/Htang/anaconda3/envs/biomedgpt/lib/python3.7/site-packages/torch/distributed/launch.py", line 191, in main
launch(args)
File "/home/Htang/anaconda3/envs/biomedgpt/lib/python3.7/site-packages/torch/distributed/launch.py", line 176, in launch
run(args)
File "/home/Htang/anaconda3/envs/biomedgpt/lib/python3.7/site-packages/torch/distributed/run.py", line 756, in run
)(*cmd_args)
File "/home/Htang/anaconda3/envs/biomedgpt/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 132, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/Htang/anaconda3/envs/biomedgpt/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 248, in launch_agent
failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
--local_rank
argument to be set, please change it to read fromos.environ['LOCAL_RANK']
instead. See https://pytorch.org/docs/stable/distributed.html#launch-utility for further instructions FutureWarning, ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 2983314) of binary: /home/Htang/anaconda3/envs/biomedgpt/bin/python3 Traceback (most recent call last): File "/home/Htang/anaconda3/envs/biomedgpt/lib/python3.7/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/home/Htang/anaconda3/envs/biomedgpt/lib/python3.7/runpy.py", line 85, in _run_code exec(code, run_globals) File "/home/Htang/anaconda3/envs/biomedgpt/lib/python3.7/site-packages/torch/distributed/launch.py", line 195, in../../evaluate.py FAILED
Failures: