Regarding to reproducing the qualitative results of NR3D and scanrefer

Daniellli commented 2 years ago

i can not reproduce the qualitative results of NR3D and Scanrefer. No matter how many epochs i train.

ayushjain1144 commented 2 years ago

Hi Daniellli, I suppose you mean the quantitative numbers on NR3D and ScanRefer:

Could you try an inference with the pre-trained checkpoint and see if you can reproduce the results? That would make sure everything is setup well.
Could you share the exact command you ran (with batch size etc) and what are the results that you are obtaining? Logs of it would be very helpful if possible!

One potential issue could be that you might be running with --lr_decay_epochs 25 26, this reduces learning rate after 25th epoch and then again at 26th. This is good for SR3D but for NR3D and ScanRefer we need to train for longer before reducing learning rate. Maybe you can try deleting this flag from your run script and train the model until the validation accuracy starts dropping, and then manually decrease the learning rate. Let us know how it goes!! (we can then add something on Readme so that others don't encounter this issue). Thank you!

Daniellli commented 2 years ago

1. I can reproduce the result

2. I use follow shell scripts to train NR3d and Scanrefer. and the results are also listed below:

Det

train_data=nr3d;
test_data=nr3d;
DATA_ROOT=datasets/
gpu_ids="1,4,5,6"
gpu_num=4
b_size=12
port=29526
#* train
CUDA_VISIBLE_DEVICES=$gpu_ids python -m torch.distributed.launch --nproc_per_node $gpu_num --master_port $port \
    train_dist_mod.py --num_decoder_layers 6 \
    --use_color \
    --weight_decay 0.0005 \
    --data_root $DATA_ROOT \
    --val_freq 5 --batch_size $b_size --save_freq 5 --print_freq 5 \
    --lr_backbone=1e-3 --lr=1e-4 \
    --dataset $train_data --test_dataset $test_data \
    --detect_intermediate --joint_det \
    --use_soft_token_loss --use_contrastive_align \
    --log_dir ./logs/bdetr \
    --lr_decay_epochs 25 26 \
    --pp_checkpoint $DATA_ROOT/gf_detector_l6o256.pth \
    --butd --self_attend --augment_det \
    --max_epoch 400 \
    2>&1 | tee -a logs/train.log

train_data=scanrefer;
test_data=scanrefer;
DATA_ROOT=datasets/
gpu_ids="2,3,4,5,6,7"
gpu_num=6
b_size=12
port=29526

save_interval=1
#* train
CUDA_VISIBLE_DEVICES=$gpu_ids python -m torch.distributed.launch --nproc_per_node $gpu_num --master_port $port \
    train_dist_mod.py --num_decoder_layers 6 \
    --use_color \
    --weight_decay 0.0005 \
    --data_root $DATA_ROOT \
    --val_freq $save_interval --batch_size $b_size --save_freq $save_interval --print_freq $save_interval \
    --lr_backbone=1e-3 --lr=1e-4 \
    --dataset $train_data --test_dataset $test_data \
    --detect_intermediate --joint_det \
    --use_soft_token_loss --use_contrastive_align \
    --log_dir ./logs/bdetr \
    --lr_decay_epochs 25 26 \
    --pp_checkpoint $DATA_ROOT/gf_detector_l6o256.pth \
    --butd --self_attend --augment_det \
    --max_epoch 400 \
    --upload-wandb \
    2>&1 | tee -a logs/train.log

<!DOCTYPE html>

datasets	acc@0.25	acc@0.50
NR3D	0.3747	0.2565
Scanrefer	0.4760	0.3324

Cls


train_data=nr3d
test_data=nr3d
DATA_ROOT=datasets/
gpu_ids="0,1,2,3,4,5,6,7"
gpu_num=8
b_size=8
port=29522

TORCH_DISTRIBUTED_DEBUG=INFO CUDA_VISIBLE_DEVICES=$gpu_ids python -m torch.distributed.launch --nproc_per_node $gpu_num --master_port $port \
    train_dist_mod.py --num_decoder_layers 6 \
    --use_color \
    --weight_decay 0.0005 \
    --data_root $DATA_ROOT \
    --val_freq 5 --batch_size $b_size --save_freq 5 --print_freq 10 \
    --lr_backbone=1e-3 --lr=1e-4 \
    --dataset $train_data --test_dataset $test_data \
    --detect_intermediate --joint_det \
    --use_soft_token_loss --use_contrastive_align \
    --log_dir ./logs/bdetr \
    --lr_decay_epochs 25 26 \
    --pp_checkpoint $DATA_ROOT/gf_detector_l6o256.pth \
    --butd_cls --self_attend \
    --max_epoch 400 \
    --upload-wandb \
    2>&1 | tee -a logs/train_test_cls.log

<!DOCTYPE html>

datasets	acc
NR3D	0.3873

Daniellli commented 2 years ago

Thank you for your reply, now i try to deleting --lr_decay_epochs 25 26 and retrain .

ayushjain1144 commented 2 years ago

Great, thanks for the detailed reply! Pretty sure removing --lr_decay_epochs 25 26 should fix it. We will re-train on our end too in coming days and update the readme with precise lr_decay_epochs for Nr3d and ScanRefer too.

ayushjain1144 commented 1 year ago

Hi Daniellli, we added some additional instructions on clarifying further the issue you raised with reference number of epochs that you might need to train for. Let us know, if you still face issues with getting the same numbers and we would be happy to help :)

Daniellli commented 1 year ago

Appreciate your help.🌹

Hiusam commented 1 year ago

Hi Daniellli, we added some additional instructions on clarifying further the issue you raised with reference number of epochs that you might need to train for. Let us know, if you still face issues with getting the same numbers and we would be happy to help :)

Hi, what are the final training scripts to reproduce NR3d? I cannot reproduce the results either. My accuracy is about 37%.

ayushjain1144 commented 1 year ago

Hi,

Here is the relevant portion from the readme

On NR3D and ScanRefer we need much more training epochs to converge. It's better to monitor the validation accuracy and decrease learning rate accordingly. For example, in det setup, we decrease lr at epochs 80 and 90 for NR3D and at epoch 65 for Scanrefer. To disable automatic learning rate decay, you can remove --lr_decay_epochs from the train script and manually decrease the learning rate when the validation accuracy converges. Be sure to add --reduce_lr flag when decreasing learning rate and continuing from a checkpoint to load optimizers correctly.

The following should work for you:

TORCH_DISTRIBUTED_DEBUG=INFO CUDA_VISIBLE_DEVICES=0 python -m torch.distributed.launch --nproc_per_node 1 --master_port $RANDOM \
    train_dist_mod.py --num_decoder_layers 6 \
    --use_color \
    --weight_decay 0.0005 \
    --data_root DATA_ROOT \
    --val_freq 5 --batch_size 24 --save_freq 5 --print_freq 1000 \
    --lr_backbone=1e-3 --lr=1e-4 \
    --dataset nr3d --test_dataset nr3d \
    --detect_intermediate --joint_det \
    --use_soft_token_loss --use_contrastive_align \
    --log_dir ./logs/bdetr \
    --lr_decay_epochs 80 90 \
    --pp_checkpoint PATH/TO/gf_detector_l6o256.pth \
    --butd --self_attend --augment_det

Just so you know, you might need to adjust hyperparameters if you change the effective batch size. In that case, I would suggest manually decreasing the learning rate when validation performance starts to saturate. If you still cannot reproduce our results, please ping us back with more details of what you tried and we would be happy to help.

Hiusam commented 1 year ago

Thanks. I am trying your setting.

nickgkan / butd_detr

Regarding to reproducing the qualitative results of NR3D and scanrefer #1

1. I can reproduce the result

2. I use follow shell scripts to train NR3d and Scanrefer. and the results are also listed below:

Det

Cls