junsukchoe / ADL

Attention-based Dropout Layer for Weakly Supervised Object Localization (CVPR 2019 Oral)
MIT License
196 stars 36 forks source link

pytorch code is not clear #2

Closed GuoleiSun closed 5 years ago

GuoleiSun commented 5 years ago

Hi,

Great paper. Thanks for providing pytorch code. I tried the code, but it is for distributed system. Could you clear it a little bit so that we can easily run it in any system with some GPUs? Also, there are some errors in the code, could you try to correct it?

Thanks

junsukchoe commented 5 years ago

Hi Guolei,

We are sorry for your inconvenience. There is a mistake when I cleaned the codes. We revise the codes to correct errors. If there are still any errors, please let us know.

Although our code uses the modules for distributed system, but you can also run it on single system. We also usually run our code on single PC with 1 or 2 GPU(s).

Thanks

GuoleiSun commented 5 years ago

Hi, I tried to use one gpu, and set args.multiprocessing_distributed as False. But I got the following error. I use python 3.7 (>3.3) and my pytorch version is : --> python -c "import torch; print(torch.version)" --> 1.1.0

File "train.py", line 354, in train for name, module in model.module.named_modules(): File "/home/ubuntu/anaconda2/envs/Senet/lib/python3.7/site-packages/torch/nn/modules/module.py", line 539, in getattr type(self).name, name)) AttributeError: 'VGG' object has no attribute 'module'

GuoleiSun commented 5 years ago

Hi,

I solve the problem by removing module in model.module.named_modules().

But how to use 2 or more gpus to train the model? How should I use arguments? Could you provide a script like run1.sh for one gpu and multiple gpus training?

Thanks a lot

junsukchoe commented 5 years ago

Hi Guolei,

You do not need to set args.multiprocessing_distributed as False. The only thing you need to do is changing the gpu list.

Here are the examples:

for 1 GPU:

gpu=0
name1=vgg_ADL1
epoch=200
decay=60
model=vgg16_ADL
server=tcp://127.0.0.1:01
batch=32
wd=5e-4
lr=0.001
ADL_pos="3M 4M 53"

CUDA_VISIBLE_DEVICES=${gpu} python train.py -a ${model} --dist-url ${server} \
--multiprocessing-distributed --world-size 1 --pretrained \
--data ../CUB_200_2011/ --dataset CUB \
--train-list datalist/CUB/train.txt \
--test-list datalist/CUB/test.txt \
--data-list datalist/CUB/ \
--ADL-pos ${ADL_pos} --ADL-rate 0.75 --ADL-thr 0.8 \
--task wsol \
--batch-size ${batch} --epochs ${epoch} --LR-decay ${decay} --wd ${wd} --lr ${lr} --nest --name ${name1}

for 2 GPUs:

gpu=0,1
name1=vgg_ADL1
epoch=200
decay=60
model=vgg16_ADL
server=tcp://127.0.0.1:01
batch=32
wd=5e-4
lr=0.001
ADL_pos="3M 4M 53"

CUDA_VISIBLE_DEVICES=${gpu} python train.py -a ${model} --dist-url ${server} \
--multiprocessing-distributed --world-size 1 --pretrained \
--data ../CUB_200_2011/ --dataset CUB \
--train-list datalist/CUB/train.txt \
--test-list datalist/CUB/test.txt \
--data-list datalist/CUB/ \
--ADL-pos ${ADL_pos} --ADL-rate 0.75 --ADL-thr 0.8 \
--task wsol \
--batch-size ${batch} --epochs ${epoch} --LR-decay ${decay} --wd ${wd} --lr ${lr} --nest --name ${name1}
GuoleiSun commented 5 years ago

OK When I run "bash scripts/run1.sh", I got the following error:

Traceback (most recent call last): File "train.py", line 387, in main() File "train.py", line 70, in main mp.spawn(main_worker, nprocs=ngpus_per_node, args=(ngpus_per_node, args)) File "/home/ubuntu/anaconda2/envs/Senet/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 167, in spawn while not spawn_context.join(): File "/home/ubuntu/anaconda2/envs/Senet/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 114, in join raise Exception(msg) Exception:

-- Process 0 terminated with the following error: Traceback (most recent call last): File "/home/ubuntu/anaconda2/envs/Senet/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap fn(i, *args) File "/raid/guolei/wsol/ADL-master/Pytorch/train.py", line 102, in main_worker world_size=args.world_size, rank=args.rank) File "/home/ubuntu/anaconda2/envs/Senet/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 406, in init_process_group store, rank, world_size = next(rendezvous(url)) File "/home/ubuntu/anaconda2/envs/Senet/lib/python3.7/site-packages/torch/distributed/rendezvous.py", line 95, in _tcp_rendezvous_handler store = TCPStore(result.hostname, result.port, world_size, start_daemon) RuntimeError: Permission denied

GuoleiSun commented 5 years ago

I didn't change anything. In run1.sh, you use one gpu. Right?

junsukchoe commented 5 years ago

Yes it's right. In our system, we can run our code without any error when we use only one gpu.

Do you change the GPU list? If your system has only one GPU, you may change the first line of run1.sh to gpu=0

GuoleiSun commented 5 years ago

Actually, my system has 8 GPUs. I tried different gpu id. All of them give the same error as above. Could you check the reason?

junsukchoe commented 5 years ago

Sure!

Could you change the port number? 01 may be a blocked port in your system. You can use other ports, e.g., 8889, 8890.

For example:

gpu=0
name1=vgg_ADL1
epoch=200
decay=60
model=vgg16_ADL
server=tcp://127.0.0.1:8889
batch=32
wd=5e-4
lr=0.001
ADL_pos="3M 4M 53"

CUDA_VISIBLE_DEVICES=${gpu} python train.py -a ${model} --dist-url ${server} \
--multiprocessing-distributed --world-size 1 --pretrained \
--data ../CUB_200_2011/ --dataset CUB \
--train-list datalist/CUB/train.txt \
--test-list datalist/CUB/test.txt \
--data-list datalist/CUB/ \
--ADL-pos ${ADL_pos} --ADL-rate 0.75 --ADL-thr 0.8 \
--task wsol \
--batch-size ${batch} --epochs ${epoch} --LR-decay ${decay} --wd ${wd} --lr ${lr} --nest --name ${name1}
GuoleiSun commented 5 years ago

I changed port to 8888 and it works. Great! Thank a lot. I will run it and let you know what results I got.

junsukchoe commented 5 years ago

I'm glad to hear that!

Please note that results from this Pytorch implementation can be slightly different with the results in paper. We used Tensorflow implementation for all experiments, as we mentioned in the paper.

If you have any further questions, just let us know.

GuoleiSun commented 5 years ago

I understand. Thanks

ahmdtaha commented 5 years ago

I am sorry to open this but I noticed something in your configuration parameters that is different from the one in the readme file & utils_args. I am using Tensorflow version

For VGG & CUB

I am trying to replicate the VGG on CUB and I am getting different results

So, I am trying to figure out what I am doing wrong

junsukchoe commented 5 years ago

@ahmdtaha

Thanks for your comment.

Recently I noticed that the tensorflow codes in this repository is slightly different from the submission version. This is because I cleaned up the codes for improving classification accuracy using ADL (this is one of our future plan as we mentioned in paper). In addition, we use different training setting between Pytorch and Tensorflow version (Pytorch implementation is not stable until now). I am sorry for your inconvenience. I revise the codes soon but unfortunately I do not have resources to test it right now. Probably I could test it after CVPR 2020 deadline.

And you can use this for reproducing our results:

python CAM-VGG.py --gpu 0 --data /notebooks/dataset/CUB200/ --cub --base-lr 0.01 --logdir VGGGAP_CUB --load VGG --batch 128 --attdrop 3 4 53 --threshold 0.80 --keep_prob 0.25
ahmdtaha commented 5 years ago

Thanks for your reply