huishao007 commented 2 weeks ago

After I deployed the environment as required, I encountered a problem when reproducing the VQA task. The following error occurred when running the evaluate_vqa_rad_beam_scale.sh file. I hope to get your help. Another point is that you said in your article that you used cuda12.2, but in the biomedgpt.yml on github, it is cuda11. Is it that the version on github is not your latest version? As a result, I encountered some problems when reproducing the downstream subtasks you provided.

My bash command is as follows： export MASTER_PORT=8082 user_dir=../../module bpe_dir=../../utils/BPE split=$1 data_dir=../../datasets/finetuning/vqa-rad data=${data_dir}/test.tsv ans2label_file=${data_dir}/trainval_ans2label.pkl

declare -a Scale=('tiny')

for scale in ${Scale[@]}; do if [[ $scale =~ "tiny" ]]; then patch_image_size=256 elif [[ $scale =~ "medium" ]]; then patch_image_size=256 elif [[ $scale =~ "base" ]]; then
patch_image_size=384 fi

path=../../checkpoint/biomedgpt_tiny.pt
result_path=./results/vqa_rad_beam/${scale}
mkdir -p $result_path
selected_cols=0,5,2,3,4

log_file=${result_path}/${scale}".log"

CUDA_VISIBLE_DEVICES=0 python3 -m torch.distributed.launch --nproc_per_node=1 --master_port=${MASTER_PORT} ../../evaluate.py \
    ${data} \
    --path=${path} \
    --user-dir=${user_dir} \
    --task=vqa_gen \
    --batch-size=64 \
    --log-format=simple --log-interval=100 \
    --seed=7 \
    --gen-subset=${split} \
    --results-path=${result_path} \
    --fp16 \
    --ema-eval \
    --beam-search-vqa-eval \
    --beam=1 \
    --unnormalized \
    --temperature=1.0 \
    --num-workers=0 \
    --model-overrides="{\"data\":\"${data}\",\"bpe_dir\":\"${bpe_dir}\",\"selected_cols\":\"${selected_cols}\",\"ans2label_file\":\"${ans2label_file}\"}" > ${log_file} 2>&1

api.py", line 248, in launch_agent failures=result.failures, torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

../../evaluate.py FAILED

Failures:

------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-08-29_15:23:58 host : h3c rank : 0 (local_rank: 0) exitcode : 1 (pid: 2983314) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================

taokz commented 2 weeks ago

The initial results were obtained using CUDA 11, as specified in the biomedgpt.yml file. Later, due to an automatic server upgrade, CUDA was updated to version 12. However, I continued running the code, and both CUDA 11 and CUDA 12 have been verified to work with the current implementation.

Regarding the error you encountered, it is due to the absence of the self.cfg.neg_sample_dir and type2ans.json files. Please refer to the previous response in Issue #10 for further details.

taokz commented 6 days ago

@huishao007 Has your issue been resolved? I think you might be using the pre-trained checkpoint instead of the fine-tuned one for vqa prediction. Using the pre-trained checkpoint may lead to the issue you’re experiencing. You can try setting the --zero-shot flag in the script, as mentioned in issue #19 as well.

CUDA_VISIBLE_DEVICES=0 python3 -m torch.distributed.launch --nproc_per_node=1 --master_port=${MASTER_PORT} ../../evaluate.py \
                ${data} \
                --path=${path} \
                --user-dir=${user_dir} \
                --task=vqa_gen \
                --selected-cols=${selected_cols} \
                --bpe-dir=${bpe_dir} \
                --patch-image-size=480 \
                --prompt-type='none' \
                --batch-size=8 \
                --log-format=simple --log-interval=10 \
                --seed=7 \
                --gen-subset=${split} \
                --results-path=${result_path} \
                --fp16 \
                **--zero-shot \**
                --beam=${beam_size} \
                --unnormalized \
                --temperature=1.0 \
                --num-workers=0 \
                > ${log_file} 2>&1

taokz / BiomedGPT

I encountered a problem when reproducing the VQA task #32

../../evaluate.py FAILED