tusen-ai / SST

Code for a series of work in LiDAR perception, including SST (CVPR 22), FSD (NeurIPS 22), FSD++ (TPAMI 23), FSDv2, and CTRL (ICCV 23, oral).
Apache License 2.0
788 stars 100 forks source link

The program stopped running less than an epoch by itself with multi-gpu and print so many message about protobuf #108

Closed ZecCheng closed 1 year ago

ZecCheng commented 1 year ago

I trained FSD with 2 3090 gpu and the program stopped running less than one epoch. Can you help me solve with this problem? Thanks for your time!

(SST) lcx@Lab504:/opt/lcx/Project/SST$ sh run.sh /opt/lcx/anaconda3/envs/SST/lib/python3.8/site-packages/torch/distributed/launch.py:178: FutureWarning: The module torch.distributed.launch is deprecated and will be removed in future. Use torchrun. Note that --use_env is set by default in torchrun. If your script expects --local_rank argument to be set, please change it to read from os.environ['LOCAL_RANK'] instead. See https://pytorch.org/docs/stable/distributed.html#launch-utility for further instructions

warnings.warn( WARNING:torch.distributed.run:


Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.


2023-05-10 10:45:46.857578: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0 2023-05-10 10:45:46.857577: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0 ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -11) local_rank: 0 (pid: 2285450) of binary: /opt/lcx/anaconda3/envs/SST/bin/python3 Traceback (most recent call last): File "/opt/lcx/anaconda3/envs/SST/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/opt/lcx/anaconda3/envs/SST/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/opt/lcx/anaconda3/envs/SST/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in main() File "/opt/lcx/anaconda3/envs/SST/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main launch(args) File "/opt/lcx/anaconda3/envs/SST/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch run(args) File "/opt/lcx/anaconda3/envs/SST/lib/python3.8/site-packages/torch/distributed/run.py", line 710, in run elastic_launch( File "/opt/lcx/anaconda3/envs/SST/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/opt/lcx/anaconda3/envs/SST/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: tools/train.py FAILED Failures: [1]: time : 2023-05-10_10:46:16 host : Lab504 rank : 1 (local_rank: 1) exitcode : -11 (pid: 2285451) error_file: <N/A> traceback : Signal 11 (SIGSEGV) received by PID 2285451 Root Cause (first observed failure): [0]: time : 2023-05-10_10:46:16 host : Lab504 rank : 0 (local_rank: 0) exitcode : -11 (pid: 2285450) error_file: <N/A> traceback : Signal 11 (SIGSEGV) received by PID 2285450

My environment lists below: 2023-05-10 10:42:21,394 - mmdet - INFO - Environment info:

sys.platform: linux Python: 3.8.16 (default, Mar 2 2023, 03:21:46) [GCC 11.2.0] CUDA available: True GPU 0,1: NVIDIA GeForce RTX 3090 CUDA_HOME: /home/lcx/cuda11 NVCC: Build cuda_11.1.TC455_06.29069683_0 GCC: gcc (GCC) 9.3.0 PyTorch: 1.10.0 PyTorch compiling details: PyTorch built with:

TorchVision: 0.11.0 OpenCV: 4.7.0 MMCV: 1.4.0 MMCV Compiler: GCC 7.3 MMCV CUDA Compiler: 11.3 MMDetection: 2.19.1 MMSegmentation: 0.14.1 MMDetection3D: 0.15.0+431d011 spconv2.0: True

ZecCheng commented 1 year ago

I tried to train with one gpu, and got these error.

[libprotobuf FATAL google/protobuf/stubs/common.cc:83] This program was compiled against version 3.9.2 of the Protocol Buffer runtime library, which is not compatible with the installed version (3.20.3). Contact the program author for an update. If you compiled the program yourself, make sure that your headers are from the same version of Protocol Buffers as your link-time library. (Version verification failed in "bazel-out/k8-opt/bin/tensorflow/core/framework/tensor_shape.pb.cc".) Aborted (core dumped)

ZecCheng commented 1 year ago

Solved by pip install protobuf==3.9.2.

Kampffussel03 commented 4 months ago

@ZecCheng can you share you installation process please?

I tried to set up SST using your specs just with a RTX 4090 but it seems like TORCH_CUDA_ARCH_LIST="8.9" is required which is not compatible with MMDetection3D 0.15.0. I followed various instructions step by step but in the end pip install -v -e . fails during the attempt to build the wheels for mmdet3d. @Abyssaledge