microsoft / TAP

TAP: Text-Aware Pre-training for Text-VQA and Text-Caption, CVPR 2021 (Oral)
MIT License
70 stars 11 forks source link

Error of Pretraining User Defined Dataset #19

Open kangzhao2 opened 2 years ago

kangzhao2 commented 2 years ago

Hi:

I want to use TAP to pretrain model on my dataset, and I prepare the dataset following your data format.

But when I try to pretrain the model with distributed setting (use only one GPU is fine), I encounter the following error:

2022-04-15T14:13:50 INFO: m4c_textvqa:, 73100/96000, train/total_loss: 1.6139 (2.9855), train/m4c_textvqa/pretrainonly_m4c_decoding_bce_with_mask: 1.6139 (2.9855), train/m4c_textvqa/maskpred_accuracy: 0.8486 (0.7797), val/total_loss: 4.3474, val/m4c_textvqa/pretrainonly_m4c_decoding_bce_with_mask: 4.3474 (4.3474), val/m4c_textvqa/maskpred_accuracy: 0.7328, max mem: 7456.0, lr: 0.00001, time: 02m 47s 324ms, eta: 10h 43m 43s 839ms 2022-04-15T14:13:50 INFO: Batch Size of one GPU:16 2022-04-15T14:14:40 ERROR: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by (1) passing the keyword argumentfind_unused_parameters=Truetotorch.nn.parallel.DistributedDataParallel; (2) making sure allforwardfunction outputs participate in calculating loss. If you already have done the above two steps, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module'sforwardfunction. Please include the loss function and the structure of the return value offorwardof your module when reporting this issue (e.g. list, dict, iterable). (prepare_for_backward at /pytorch/torch/csrc/distributed/c10d/reducer.cpp:514) frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x33 (0x7ff58f8d1193 in /home/pai/envs/vqa/lib/python3.6/site-packages/torch/lib/libc10.so) frame #1: c10d::Reducer::prepare_for_backward(std::vector<at::Tensor, std::allocator<at::Tensor> > const&) + 0x731 (0x7ff5dae6ff81 in /home/pai/envs/vqa/lib/python3.6/site-packages/torch/lib/libtorch_python.so) frame #2: <unknown function> + 0xa0f14a (0x7ff5dae5c14a in /home/pai/envs/vqa/lib/python3.6/site-packages/torch/lib/libtorch_python.so) frame #3: <unknown function> + 0x2961c4 (0x7ff5da6e31c4 in /home/pai/envs/vqa/lib/python3.6/site-packages/torch/lib/libtorch_python.so) frame #4: _PyCFunction_FastCallDict + 0x262 (0x56330c484562 in /home/pai/envs/vqa/bin/python) frame #5: <unknown function> + 0x183135 (0x56330c4b0135 in /home/pai/envs/vqa/bin/python) ...

Training loss drops as expected, but after several iterations (73100 iters in the above case), the above error happened. Which is very strange, since the kind of error should happened before the training starts.

Have you ever encounter the above problem? Or could you help me solve the problem?

Thanks very much.

Kang

zyang-ur commented 2 years ago

Hi @kangzhao2 ,

I have not observed this problem in the TAP repo. However, this seems to be a known issue in distributed data parallel pytorch training. The cause, as indicated by the error message, is that "making sure all forward function outputs participate in calculating loss." But it does not explain why the error happens in the middle of the training. Please free feel to share more information and we could discuss about it. Thank you.

kangzhao2 commented 2 years ago

Dear zyang-ur:

It may take a long time to solve the problem. But if you public the total VizWiz dataset, I may be able to bypass this problem.

I can only find a small part of VizWiz dataset in "data/ocr_feat_resx/stvqa_conf/vizwiz" and "data/feat_resx/stvqa/train/vizwiz".

Kang

zyang-ur commented 2 years ago

Thank you for the info.

Unfortunately, we didn't experiment on Vizwiz and do not have the feature ready (the subset you found is from ST-VQA). But does this suggest that the error is related to the dataset (i.e., work on other datasets)? That seems strange.

kangzhao2 commented 2 years ago

Could you share the feature extractor of your dataset, so I can extract the feature of vizwiz by myself

Kang

Zhengyuan Yang @.***> 于2022年4月30日周六 02:20写道:

Thank you for the info.

Unfortunately, we didn't experiment on Vizwiz and do not have the feature ready (the subset you found is from ST-VQA). But does this suggest that the error is related to the dataset (i.e., work on other datasets)? That seems strange.

— Reply to this email directly, view it on GitHub https://github.com/microsoft/TAP/issues/19#issuecomment-1113595228, or unsubscribe https://github.com/notifications/unsubscribe-auth/AP7C5ATKS7RXBJT5VFK3JT3VHQR5LANCNFSM5TSKAVZA . You are receiving this because you were mentioned.Message ID: @.***>

zyang-ur commented 2 years ago

Hi Kang,

For the BUTD feature, we extract it with the tool provided by M4C here https://github.com/facebookresearch/mmf/tree/main/projects/m4c/scripts, which is based on Detectron.

For the VinVL feature, we use the VinVL repo https://github.com/microsoft/Oscar. Quote: "04/13/2021: Our Scene Graph Benchmark Repo has been released. Welcome to use the code there to extract image features with VinVL pretrained models."

Thank you.