Closed mlweilert closed 5 years ago
Hm. That's strange. It could be that one of the processes crashed. Does it happen even if you set num-workers to 1? Could you try installing pytorch?
Good call! Setting it to num-workers = 1 fixed it! Any idea why?
Somehow the parallel workers got stuck. Do you have pytorch installed or not?
Yes, version 1.2.0, build py3.6_cuda10.0.130_cudnn7.6.2_0. Should I try updating it?
You could try removing pytorch from the environment and then a numpy version of it will be used. It's due to the following data-loader issue in pytorch: https://github.com/pytorch/pytorch/issues/1355
I uninstalled pytorch from the environment and it worked nicely. Thanks for helping me trace the issue! I will make sure to set up the environment based on these discussions in the AWS AMI. After the AMI is created, I'll push a change to the README.md to the public link to the AMI.
When training BPNet, if you (1) do not specify
--in-memory
and also (2) have a--config
input that is different from thebpnet9
premade config file, the data does not load properly and the process freezes right before training the model. Everything loads correctly until that point. All CPU and GPU usage also crashes.