Something wrong with MPI_Init

YaqinLong commented 1 year ago

Dear author: I want to deploy bevfusion on Jetson Xavier NX, whose dependecies cannot be satisfied. For example, Jetson Xavier NX has CUDA11.4 and it can only install pytorch >=1.11.0(python3.8). After running the code, it said:" It looks like opal_init failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during opal_init; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer)"

kentang-mit commented 1 year ago

If you want to deploy the model on environment with only 1 GPU, I would highly recommend you to slightly refactor the codebase and remove the dependency on torchpack. torchpack is used in this codebase mainly for multi-GPU training, but not single-GPU inference. If you do not have the torchpack dependency, there should not be a problem related to OpenMPI.

kentang-mit commented 1 year ago

Closed due to inactivity. Please feel free to reopen if you feel it necessary.

mit-han-lab / bevfusion

Something wrong with MPI_Init #336