A problem about the "learned_owod_t2_ft.txt"

harrylin-hyl commented 1 year ago

Hi orr, I'm sorry to borther you again and again. I encountered a problem when generating the "learned_owod_t2_ft.txt" file. As shown in the first figure, it lists some blank or error image ids. The number also comes to 1749 instead of 1743 predefined in the config file. However, when I generate "learned_owod_t2_ft.txt" with one GPU, everything is ok, as shown in the below figure. I guess the problem may be the "dist.all_gather_object" function, which combines the output of all GPUs. But I have no idea to fix it.

Best regards, harrylin

orrzohar commented 1 year ago

Hi @harrylin-hyl,

That is fine! If you run into any reproducibility issues, please do not hesitate.

I have never run into this issue. From your description, I think I know where this is coming from, but I need to reproduce this myself to debug.

Could you please send me how many / which GPUs you are using?

Best, Orr

harrylin-hyl commented 1 year ago

Hi orr,

Thanks for your timely reply, I am using 4 A100 40G GPUs. This error occurs when I use 4 GPUs and the problem goes away when I use a single GPU.

Best, harrylin

orrzohar commented 1 year ago

Hi @harrylin-hyl, I have managed to reproduce this issue and found that storing the image names as str rather than int does the trick. I am pushing the revised repository into a side branch, 'exemplar-replay-write-fix'.

You can skip training, and directly jump to collecting the exemplar by initializing the command with weights of a model that was already trained on the same number of epochs as that specified in the command, e.g.:

python -u main_open_world.py \
    --output_dir "${EXP_DIR}/t1" --dataset TOWOD --PREV_INTRODUCED_CLS 0 --CUR_INTRODUCED_CLS 20\
    --train_set 'owod_t1_train' --test_set 'owod_all_task_test' --epochs 41\
    --model_type 'prob' --obj_loss_coef 8e-4 --obj_temp 1.3\
    --wandb_name "${WANDB_NAME}_t1" --exemplar_replay_selection --exemplar_replay_max_length 850\
    --exemplar_replay_dir ${WANDB_NAME} --exemplar_replay_cur_file "learned_owod_t1_ft.txt"\
    --pretrain "exps/MOWODB/t1.pth" --lr 2e-5\
    ${PY_ARGS}

PY_ARGS=${@:1}
python -u main_open_world.py \
    --output_dir "${EXP_DIR}/t2" --dataset TOWOD --PREV_INTRODUCED_CLS 20 --CUR_INTRODUCED_CLS 20\
    --train_set 'owod_t2_train' --test_set 'owod_all_task_test' --epochs 51\
    --model_type 'prob' --obj_loss_coef 8e-4 --obj_temp 1.3 --freeze_prob_model\
    --wandb_name "${WANDB_NAME}_t2"\
    --exemplar_replay_selection --exemplar_replay_max_length 1743 --exemplar_replay_dir ${WANDB_NAME}\
    --exemplar_replay_prev_file "learned_owod_t1_ft.txt" --exemplar_replay_cur_file "learned_owod_t2_ft.txt"\
    --pretrain "exps/MOWODB/t2.pth"\
    ${PY_ARGS}

Would you mind verifying that this indeed solved this issue? And if so, let me know & I will merge this side branch back into main.

Best, Orr

harrylin-hyl commented 1 year ago

Hi orr,

I have tested this solution and the bug does not occur again, thank you very much!

Best, harrylin

orrzohar commented 1 year ago

Hi @harrylin-hyl,

Great! I will now merge this fix into the main branch.

Thank you for pointing out this issue, and please do not hesitate to point out future issues you encounter with the codebase!

Best, Orr

orrzohar / PROB

A problem about the "learned_owod_t2_ft.txt" #13