taokz / BiomedGPT

BiomedGPT: A Generalist Vision-Language Foundation Model for Diverse Biomedical Tasks
https://www.nature.com/articles/s41591-024-03185-2
Apache License 2.0
515 stars 59 forks source link

BPE Directory issue #53

Open anasster opened 1 day ago

anasster commented 1 day ago

Hello, I am trying to run a zero-shot VQA script but it cannot find the BPE directory, even when I explicitly set it to GPT2. Here is my inference script:

`export MASTER_PORT=1091

user_dir=../../module bpe_dir=../../utils/BPE

split=test bpe=gpt2

data_dir=../../datasets/finetuning/gastrolab data=${data_dir}/gastrolab.tsv

declare -a Scale=('tiny' 'medium' 'base')

for scale in ${Scale[@]}; do if [[ $scale =~ "tiny" ]]; then patch_image_size=256 elif [[ $scale =~ "medium" ]]; then patch_image_size=256 elif [[ $scale =~ "base" ]]; then
patch_image_size=384 fi

path=../../checkpoints/${scale}/checkpoint.pt
result_path=./results/gastrolab/${scale}
mkdir -p ${result_path}

selected_cols=0,1,2,3,4
log_file=${result_path}/${scale}".log"

CUDA_VISIBLE_DEVICES=0 torchrun --nproc_per_node=1 --master_port=${MASTER_PORT} ../../evaluate.py \
    ${data} \
    --path=${path} \
    --user-dir=${user_dir} \
    --task=vqa_gen \
    --batch-size=16 \
    --log-format=simple --log-interval=10 \
    --seed=42 \
    --gen-subset=${split} \
    --results-path=${result_path} \
    --fp16 \
    --ema-eval \
    --beam-search-vqa-eval \
    --zero-shot \
    --match-source-len \
    --beam=3 \
    --temperature=1.0 \
    --num-shards=1 \
    --distributed-world-size=1 \
    --model-overrides="{\'data\':\'${data}\',\'bpe_dir\':\'${bpe_dir}\', \'bpe\':\'${bpe}\',\'selected_cols\':\'${selected_cols}\'}" > ${log_file} 2>&1
python ../caption/metric_caption.py ${data} ${result_path}/test_predict.json

done `

and here is my error log:

Errors:

2024-10-24 17:37:23 | INFO | fairseq.distributed.utils | distributed init (rank 0): env:// 2024-10-24 17:37:23 | INFO | fairseq.distributed.utils | Start init 2024-10-24 17:37:23 | INFO | torch.distributed.distributed_c10d | Added key: store_based_barrier_key:1 to store for rank: 0 2024-10-24 17:37:23 | INFO | torch.distributed.distributed_c10d | Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 1 nodes. 2024-10-24 17:37:23 | INFO | fairseq.distributed.utils | initialized host DIMITRA as rank 0 single-machine distributed training is initialized. 2024-10-24 17:37:24 | INFO | ofa.evaluate | {'_name': None, 'common': {'_name': None, 'no_progress_bar': False, 'log_interval': 10, 'log_format': 'simple', 'log_file': None, 'tensorboard_logdir': None, 'wandb_project': None, 'azureml_logging': False, 'seed': 42, 'cpu': False, 'tpu': False, 'bf16': False, 'memory_efficient_bf16': False, 'fp16': True, 'memory_efficient_fp16': False, 'fp16_no_flatten_grads': False, 'fp16_init_scale': 128, 'fp16_scale_window': None, 'fp16_scale_tolerance': 0.0, 'on_cpu_convert_precision': False, 'min_loss_scale': 0.0001, 'threshold_loss_scale': None, 'amp': False, 'amp_batch_retries': 2, 'amp_init_scale': 128, 'amp_scale_window': None, 'user_dir': '../../module', 'empty_cache_freq': 0, 'all_gather_list_size': 16384, 'model_parallel_size': 1, 'quantization_config_path': None, 'profile': False, 'reset_logging': False, 'suppress_crashes': False, 'use_plasma_view': False, 'plasma_path': '/tmp/plasma'}, 'common_eval': {'_name': None, 'path': '../../checkpoints/base/checkpoint.pt', 'post_process': None, 'quiet': False, 'model_overrides': '{"data":"../../datasets/finetuning/gastrolab/gastrolab.tsv","bpe_dir":"../../utils/BPE","selected_cols":"0,1,2,3,4"}', 'results_path': './results/gastrolab/base'}, 'distributed_training': {'_name': None, 'distributed_world_size': 1, 'distributed_num_procs': 1, 'distributed_rank': 0, 'distributed_backend': 'nccl', 'distributed_init_method': 'env://', 'distributed_port': -1, 'device_id': 0, 'distributed_no_spawn': True, 'ddp_backend': 'pytorch_ddp', 'ddp_comm_hook': 'none', 'bucket_cap_mb': 25, 'fix_batches_to_gpus': False, 'find_unused_parameters': False, 'gradient_as_bucket_view': False, 'fast_stat_sync': False, 'heartbeat_timeout': -1, 'broadcast_buffers': False, 'slowmo_momentum': None, 'slowmo_algorithm': 'LocalSGD', 'localsgd_frequency': 3, 'nprocs_per_node': 1, 'pipeline_model_parallel': False, 'pipeline_balance': None, 'pipeline_devices': None, 'pipeline_chunks': 0, 'pipeline_encoder_balance': None, 'pipeline_encoder_devices': None, 'pipeline_decoder_balance': None, 'pipeline_decoder_devices': None, 'pipeline_checkpoint': 'never', 'zero_sharding': 'none', 'fp16': True, 'memory_efficient_fp16': False, 'tpu': False, 'no_reshard_after_forward': False, 'fp32_reduce_scatter': False, 'cpu_offload': False, 'use_sharded_state': False}, 'dataset': {'_name': None, 'num_workers': 1, 'skip_invalid_size_inputs_valid_test': False, 'max_tokens': None, 'batch_size': 16, 'required_batch_size_multiple': 8, 'required_seq_len_multiple': 1, 'dataset_impl': None, 'data_buffer_size': 10, 'train_subset': 'train', 'valid_subset': 'valid', 'combine_valid_subsets': None, 'ignore_unused_valid_subsets': False, 'validate_interval': 1, 'validate_interval_updates': 0, 'validate_after_updates': 0, 'fixed_validation_seed': None, 'disable_validation': False, 'max_tokens_valid': None, 'batch_size_valid': 16, 'max_valid_steps': None, 'curriculum': 0, 'gen_subset': 'test', 'num_shards': 1, 'shard_id': 0}, 'optimization': {'_name': None, 'max_epoch': 0, 'max_update': 0, 'stop_time_hours': 0.0, 'clip_norm': 0.0, 'sentence_avg': False, 'update_freq': [1], 'lr': [0.25], 'stop_min_lr': -1.0, 'use_bmuf': False}, 'checkpoint': {'_name': None, 'save_dir': 'checkpoints', 'restore_file': 'checkpoint_last.pt', 'finetune_from_model': None, 'reset_dataloader': False, 'reset_lr_scheduler': False, 'reset_meters': False, 'reset_optimizer': False, 'optimizer_overrides': '{}', 'save_interval': 1, 'save_interval_updates': 0, 'keep_interval_updates': -1, 'keep_interval_updates_pattern': -1, 'keep_last_epochs': -1, 'keep_best_checkpoints': -1, 'no_save': False, 'no_epoch_checkpoints': False, 'no_last_checkpoints': False, 'no_save_optimizer_state': False, 'best_checkpoint_metric': 'loss', 'maximize_best_checkpoint_metric': False, 'patience': -1, 'checkpoint_suffix': '', 'checkpoint_shard_count': 1, 'load_checkpoint_on_all_dp_ranks': False, 'write_checkpoints_asynchronously': False, 'model_parallel_size': 1, 'use_ema_weights_to_init_param': False, 'use_latest_weights_to_init_ema': False}, 'bmuf': {'_name': None, 'block_lr': 1.0, 'block_momentum': 0.875, 'global_sync_iter': 50, 'warmup_iterations': 500, 'use_nbm': False, 'average_sync': False, 'distributed_world_size': 1}, 'generation': {'_name': None, 'beam': 3, 'nbest': 1, 'max_len_a': 0.0, 'max_len_b': 200, 'min_len': 1, 'match_source_len': True, 'unnormalized': False, 'no_early_stop': False, 'no_beamable_mm': False, 'lenpen': 1.0, 'unkpen': 0.0, 'replace_unk': None, 'sacrebleu': False, 'score_reference': False, 'prefix_size': 0, 'no_repeat_ngram_size': 0, 'sampling': False, 'sampling_topk': -1, 'sampling_topp': -1.0, 'constraints': None, 'temperature': 1.0, 'diverse_beam_groups': -1, 'diverse_beam_strength': 0.5, 'diversity_rate': -1.0, 'print_alignment': None, 'print_step': False, 'lm_path': None, 'lm_weight': 0.0, 'iter_decode_eos_penalty': 0.0, 'iter_decode_max_iter': 10, 'iter_decode_force_max_iter': False, 'iter_decode_with_beam': 1, 'iter_decode_with_external_reranker': False, 'retain_iter_history': False, 'retain_dropout': False, 'retain_dropout_modules': None, 'decoding_format': None, 'no_seed_provided': False}, 'eval_lm': {'_name': None, 'output_word_probs': False, 'output_word_stats': False, 'context_window': 0, 'softmax_batch': 9223372036854775807}, 'interactive': {'_name': None, 'buffer_size': 0, 'input': '-'}, 'model': None, 'task': {'_name': 'vqa_gen', 'data': '../../datasets/finetuning/gastrolab/gastrolab.tsv', 'selected_cols': None, 'bpe': None, 'bpe_dir': None, 'max_source_positions': 1024, 'max_target_positions': 1024, 'max_src_length': 128, 'max_tgt_length': 30, 'code_dict_size': 8192, 'patch_image_size': 480, 'orig_patch_image_size': 256, 'num_bins': 1000, 'imagenet_default_mean_and_std': False, 'constraint_range': None, 'max_object_length': 30, 'ans2label_dict': '{"no": 0, "yes":1}', 'ans2label_file': None, 'unconstrained_training': False, 'add_object': False, 'valid_batch_size': 20, 'prompt_type': None, 'uses_ema': False, 'val_inference_type': 'allcand', 'eval_args': '{"beam":5,"unnormalized":true,"temperature":1.0}'}, 'criterion': {'_name': 'cross_entropy', 'sentence_avg': True}, 'optimizer': None, 'lr_scheduler': {'_name': 'fixed', 'force_anneal': None, 'lr_shrink': 0.1, 'warmup_updates': 0, 'lr': [0.25]}, 'scoring': {'_name': 'bleu', 'pad': 1, 'eos': 2, 'unk': 3}, 'bpe': None, 'tokenizer': None, 'ema': {'_name': None, 'store_ema': False, 'ema_decay': 0.9999, 'ema_start_update': 0, 'ema_seed_model': None, 'ema_update_freq': 1, 'ema_fp32': False}, 'simul_type': None} 2024-10-24 17:37:24 | INFO | ofa.evaluate | loading model(s) from ../../checkpoints/base/checkpoint.pt Traceback (most recent call last): File "../../evaluate.py", line 160, in cli_main() File "../../evaluate.py", line 155, in cli_main cfg, main, ema_eval=args.ema_eval, beam_search_vqa_eval=args.beam_search_vqa_eval, zero_shot=args.zero_shot File "/home/dimitra/BiomedGPT/fairseq/fairseq/distributed/utils.py", line 374, in call_main distributed_main(cfg.distributed_training.device_id, main, cfg, kwargs) File "/home/dimitra/BiomedGPT/fairseq/fairseq/distributed/utils.py", line 348, in distributed_main main(cfg, kwargs) File "../../evaluate.py", line 67, in main task = tasks.setup_task(cfg.task) File "/home/dimitra/BiomedGPT/fairseq/fairseq/tasks/init.py", line 46, in setup_task return task.setup_task(cfg, kwargs) File "/home/dimitra/BiomedGPT/tasks/ofa_task.py", line 94, in setup_task os.path.join(cfg.bpe_dir, "dict.txt") File "/home/dimitra/.pyenv/versions/3.7.4/lib/python3.7/posixpath.py", line 80, in join a = os.fspath(a) TypeError: expected str, bytes or os.PathLike object, not NoneType ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 4993) of binary: /home/dimitra/.pyenv/versions/biomedgpt/bin/python Traceback (most recent call last): File "/home/dimitra/.pyenv/versions/biomedgpt/bin/torchrun", line 8, in sys.exit(main()) File "/home/dimitra/.pyenv/versions/biomedgpt/lib/python3.7/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper return f(*args, *kwargs) File "/home/dimitra/.pyenv/versions/biomedgpt/lib/python3.7/site-packages/torch/distributed/run.py", line 762, in main run(args) File "/home/dimitra/.pyenv/versions/biomedgpt/lib/python3.7/site-packages/torch/distributed/run.py", line 756, in run )(cmd_args) File "/home/dimitra/.pyenv/versions/biomedgpt/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 132, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/dimitra/.pyenv/versions/biomedgpt/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 248, in launch_agent failures=result.failures, torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

../../evaluate.py FAILED

Failures:

------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-10-24_17:37:28 host : DIMITRA. rank : 0 (local_rank: 0) exitcode : 1 (pid: 4993) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================
taokz commented 1 day ago

Hi, please remove 'bpe':'${bpe}'. The BPE is already set in the script as utils/BPE. The original script without modifications can work as expected.