nocaps-org / updown-baseline

Baseline model for nocaps benchmark, ICCV 2019 paper "nocaps: novel object captioning at scale".
https://nocaps.org
MIT License
75 stars 12 forks source link

Problem arises when I train updown_nocaps setting #9

Closed chenxy99 closed 4 years ago

chenxy99 commented 4 years ago

Hi, thanks a lot for this great dataset.

First, I follow the instruction on 'How to setup this codebase?' to set up my anaconda env (updown) as well as set the token through EvalAI CLI.

Next, based on the instruction on 'How to train your captioner?', I use the script

python scripts/train.py --config configs/updown_nocaps_val.yaml --config-override OPTIM.BATCH_SIZE 250 --checkpoint-every 10 --gpu-ids 0 --serialization-dir checkpoints/updown-baseline

I find that the script can run through into the validation part in /scripts/train.py and print first 25 captions with their Image ID in train.py Line 261-263

Print first 25 captions with their Image ID.

for k in range(25): print(predictions[k]["image_id"], predictions[k]["caption"])

Then, however, the script seems to stop running like the figure shown below. image_cap

And I wait for a long time (at least half an hour) I got an error message image_error

I doubt that there exist to be some problem with evalai. But in the conda list, I found the evalai has been installed. Hence, I doubt one possibility is that when I run pip install -r requirements.txt, I find a message that some modules are incompatible. I try a lot, but I cannot find a good balance so that all of the modules are compatible. conda_install

So I hope that you can provide the environment.yaml conda env export > environment.yaml for me and I can try it by using the same versions of all the modules.

If it is not the case, could you help me to solve this problem.

Thanks a lot again.

kdexd commented 4 years ago

Hi @chenxy99 — glad you liked our work! It looks like your predictions are not being uploaded to EvalAI for validation. Can you double-check if you compute machine has internet access? Also, please try to save one prediction file as JSON (and remove --evalai-submit) flag, and go to evalai.cloudcv.org and submit it manually. Let me know if EvalAI does not accept your file, thanks!

chenxy99 commented 4 years ago

Thanks for your help. It works well.

kdexd commented 4 years ago

That's great, glad it worked!

chenxy99 commented 4 years ago

Hello, I find that your code works well in a single gpu scenario. But in multi-gpus setting, it seems to have some problem. I use the script below python scripts/train.py --config configs/updown_plus_cbs_nocaps_val.yaml --config-override OPTIM.BATCH_SIZE 250 --checkpoint-every 10 --gpu-ids 0 1 --serialization-dir checkpoints/updown_plus_cbs_test In the second evaluation for the nocaps val, an error occurs.

0%| | 19/70000 [07:59<152:06:17, 7.82s/it] Traceback (most recent call last): | 324/750 [01:36<02:04, 3.42it/s] File "scripts/train.py", line 239, in num_constraints=batch.get("num_constraints", None), File "/home/xianyu/anaconda3/envs/updown/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in call result = self.forward(*input, *kwargs) File "/home/xianyu/anaconda3/envs/updown/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 153, in forward return self.gather(outputs, self.output_device) File "/home/xianyu/anaconda3/envs/updown/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 165, in gather return gather(outputs, output_device, dim=self.dim) File "/home/xianyu/anaconda3/envs/updown/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 67, in gather return gather_map(outputs) File "/home/xianyu/anaconda3/envs/updown/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 61, in gather_map for k in out)) File "/home/xianyu/anaconda3/envs/updown/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 61, in for k in out)) File "/home/xianyu/anaconda3/envs/updown/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 54, in gather_map return Gather.apply(target_device, dim, outputs) File "/home/xianyu/anaconda3/envs/updown/lib/python3.6/site-packages/torch/nn/parallel/_functions.py", line 68, in forward return comm.gather(inputs, ctx.dim, ctx.target_device) File "/home/xianyu/anaconda3/envs/updown/lib/python3.6/site-packages/torch/cuda/comm.py", line 165, in gather return torch._C._gather(tensors, dim, destination) RuntimeError: Gather got an input of invalid size: got [3, 4], but expected [3, 20] (gather at /pytorch/torch/csrc/cuda/comm.cpp:238) frame #0: std::function<std::string ()>::operator()() const + 0x11 (0x7fa99d5c7441 in /home/xianyu/anaconda3/envs/updown/lib/python3.6/site-packages/torch/lib/libc10.so) frame #1: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x2a (0x7fa99d5c6d7a in /home/xianyu/anaconda3/envs/updown/lib/python3.6/site-packages/torch/lib/libc10.so) frame #2: torch::cuda::gather(c10::ArrayRef, long, c10::optional) + 0x55a (0x7fa99c95138a in /home/xianyu/anaconda3/envs/updown/lib/python3.6/site-packages/torch/lib/libtorch.so.1) frame #3: + 0x5a230c (0x7fa9dcd3830c in /home/xianyu/anaconda3/envs/updown/lib/python3.6/site-packages/torch/lib/libtorch_python.so) frame #4: + 0x130cfc (0x7fa9dc8c6cfc in /home/xianyu/anaconda3/envs/updown/lib/python3.6/site-packages/torch/lib/libtorch_python.so)

frame #15: THPFunction_apply(_object*, _object*) + 0x6b1 (0x7fa9dcb49481 in /home/xianyu/anaconda3/envs/updown/lib/python3.6/site-packages/torch/lib/libtorch_python.so)

It seems there are some problems form the output of CBS.

RuntimeError: Gather got an input of invalid size: got [3, 4], but expected [3, 20] (gather at /pytorch/torch/csrc/cuda/comm.cpp:238)

I would like to know how I can fix this problem.

Thanks for your help a lot again.

chenxy99 commented 4 years ago

Hello @kdexd , this morning, I find that every time I submit my evaluation json file to EVAL AI, the status is always 'submitted'. I wait for about 4 hours, but I still cannot get the 'finished' status (It usually costs 1 minute to change to 'finished' status). I would like to know whether there is something wrong with the eval ai for nocaps from this morning and hopefully, you can help me solve this issue.

Thanks a lot.