mit-han-lab / bevfusion

[ICRA'23] BEVFusion: Multi-Task Multi-Sensor Fusion with Unified Bird's-Eye View Representation
https://bevfusion.mit.edu
Apache License 2.0
2.35k stars 423 forks source link

Output interrupt #363

Closed hasaikeyQAQ closed 1 year ago

hasaikeyQAQ commented 1 year ago

When I use the instructions mentioned in readme.md for training. Command line instructions:torchpack dist-run -np 2 python tools/train.py configs/nuscenes/seg/fusion-bev256d2-lss.yaml --model.encoders.camera.backbone.init_cfg.checkpoint pretrained/swint-nuimages-pretrained.pthThe output appears to be stuck. image I have tried changing the Shapely library version to 1.8.0 and adding environment variablesCUDA_LAUNCH_BLOCKING=1to force CUDA synchronization, but none of them has helped. (My current shapely version is 1.8.5).I think this issue may be related to the following factors: My version of Torchpack is 0.3.1, which may have an incompatibility issue. In addition, I am using the Slurm job management system, which may cause the process to be killed.Do you have any other suggestions or solutions? If you need to provide more information, please let me know. Thank you for your help!

Originally posted by @hasaikeyQAQ in https://github.com/mit-han-lab/bevfusion/issues/339#issuecomment-1488120007

kentang-mit commented 1 year ago

I've responded in issue 339. I'm sorry that we have not experimented with the slurm job managment system and we use MPI instead. Therefore it could be hard for me to help you debug this case.

hasaikeyQAQ commented 1 year ago

Dear Haotian,

I have received your reply and thank you for your help.

Best regards

kentang-mit commented 1 year ago

Sounds good! I'm closing this issue since it is resolved.