Closed hasaikeyQAQ closed 1 year ago
I've responded in issue 339. I'm sorry that we have not experimented with the slurm job managment system and we use MPI instead. Therefore it could be hard for me to help you debug this case.
Dear Haotian,
I have received your reply and thank you for your help.
Best regards
Sounds good! I'm closing this issue since it is resolved.
When I use the instructions mentioned in readme.md for training. Command line instructions:
torchpack dist-run -np 2 python tools/train.py configs/nuscenes/seg/fusion-bev256d2-lss.yaml --model.encoders.camera.backbone.init_cfg.checkpoint pretrained/swint-nuimages-pretrained.pth
The output appears to be stuck. I have tried changing the Shapely library version to 1.8.0 and adding environment variablesCUDA_LAUNCH_BLOCKING=1
to force CUDA synchronization, but none of them has helped. (My current shapely version is 1.8.5).I think this issue may be related to the following factors: My version of Torchpack is 0.3.1, which may have an incompatibility issue. In addition, I am using the Slurm job management system, which may cause the process to be killed.Do you have any other suggestions or solutions? If you need to provide more information, please let me know. Thank you for your help!Originally posted by @hasaikeyQAQ in https://github.com/mit-han-lab/bevfusion/issues/339#issuecomment-1488120007