Closed KechenQin closed 4 years ago
@KechenQin You can try to reduce NUM_WORKERS_PER_GPU in the yaml file.
thanks for the reply!
I reduced NUM_WORKERS from 4 to 1, but I still got the same error. Basically when I run loss.backward(), I got this error. I am working with Tesla V100 gpu ((16G). Please let me know if there is any other idea.
Could you provide more details about your environment, including system version, cuda version, python version, pytorch version, e.t.c? And how many V100 gpus are you used to run the code? Which config yaml do you use?
I am working with linux, conda virtual environment, cuda version 9.0, python3.6.5, I have 8 gpus in total, but I just tested VL-BERT with one gpu. I tried to use 4 gpus following the default setup, but I got the same error. I am using cfgs/refcoco/base_detected_regions_4x16G.yaml as config file.
btw, I did not install tensorflow in this environment and I did not see any dependency errors. I am not sure if that is the reason of this issue.
I got problem solved after using a different aws ami.
Thank you for the code. I want to fine tune this model on refcoco dataset. I got a segmentation fault error when I run the non-distributed sh file. Please help.
[Partial Load] non pretrain keys: ['final_mlp.2.weight', 'final_mlp.2.bias'] PROGRESS: 0.00% Segmentation fault