mit-han-lab / spvnas

[ECCV 2020] Searching Efficient 3D Architectures with Sparse Point-Voxel Convolution
http://spvnas.mit.edu/
MIT License
582 stars 109 forks source link

dist.init() hang/stuck #85

Closed l9761116 closed 2 years ago

l9761116 commented 2 years ago

When I run the instruction "torchpack dist-run -np 1 python evaluate.py configs/semantic_kitti/default.yaml --name SemanticKITTI_val_SPVNAS@65GMACs" , it stuck at "dist.init()" and there's no information output about it. Then I tried "python evaluate.py configs/semantic_kitti/default.yaml --name SemanticKITTI_val_SPVNAS@65GMACs --distributed False" but it reports ''' File "evaluate.py", line 24, in main dist.init() File "/home/anaconda3/envs/torch/lib/python3.7/site-packages/torchpack/distributed/context.py", line 23, in init master_host = 'tcp://' + os.environ['MASTER_HOST'] File "/home/anaconda3/envs/torch/lib/python3.7/os.py", line 681, in getitem raise KeyError(key) from None KeyError: 'MASTER_HOST' ''' Does anyone meet this problem? Is it an environment problem? Some of my packages are as follows: ''' cudatoolkit 11.3.1 h2bc3f7f_2 mpi 1.0 openmpi conda-forge mpi4py 3.1.3 pypi_0 pypi pytorch 1.7.0 py3.7_cpu_0 [cpuonly] pytorch torchpack 0.3.1 pypi_0 pypi torchsparse 1.4.0 pypi_0 pypi torchvision 0.8.1 py37_cpu [cpuonly] pytorch tqdm 4.63.0 pypi_0 pypi ''' I think the problem may be related to mpi (?) mpi4py or something. But I don't quite know about it. So does anyone know the solution?

zhijian-liu commented 2 years ago

Please install the latest TorchPack:

pip install --upgrade git+https://github.com/zhijian-liu/torchpack.git

This should allow you to run the evaluation without torchpack dist-run. Btw, I noticed that you installed the CPU-version PyTorch. Could you try installing the GPU version instead?

l9761116 commented 2 years ago

yes, thanks for the remind. I reinstall the GPU version and upgrade torchpack. But there's new problem saying ''' import torchsparse.backend ImportError: /home/anaconda3/envs/torch/lib/python3.7/site-packages/torchsparse/backend.cpython-37m-x86_64-linux-gnu.so: undefined symbol: _ZNK2at6Tensor6deviceEv

Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted.

mpirun detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was:

Process name: [[34422,1],0] Exit code: 1

''' I have no idea about it.

zhijian-liu commented 2 years ago

Thanks for the update! Could you please also reinstall TorchSparse?