Open kylevedder opened 3 years ago
I was able to reproduce this error on a friend's system using my Dockerfile
, making me think this is not an issue with my base machine.
I tried compiling Open3D from source inside the docker container. I was able to get it to compile if I did not compile the ML library, causing the training pipeline to fail due to a failed import, but if I tried to compile that none of the targets were able to succeed.
I am going to reimplement the iou_bev
function, as that seems to be the only function blocking training.
As per @sanskar107's suggestion, adding --pipeline.num_workers 0
serves as a viable workaround.
As suggested, the root cause seems to be any custom op in the PyTorch dataloader, not just Open3D's iou_bev
function, as my own iou_bev
, a Python wrapper around a C++ implementation, also causes the same hanging issue.
@kylevedder Thanks for reporting this. The main problem seems to be related to this issue https://github.com/pytorch/pytorch/issues/46409 Could you try the workarounds mentioned there and create a pull request if anything works?
Setting OMP_NUM_THREADS=1
and running with the default number of workers fixes this issue and has significantly higher throughput than setting ---pipeline.num_workers 0
. I will investigate this thread more later.
The
iou_bev
invocation insidedatasets/utils/operations.py
'sbox_collision_test()
hangs on invocation with no CPU or memory usage, blocking all forward progress.This can be reproduced using the following
Dockerfile
, which codifies the instructions provided in the README, and then the following command run inside the container with this repo mounted at/Open3D-ML
:which produces
and then makes no forward progress. The installed
open3d
version is0.13.0
, as perThis was run on a machine with Driver Version: 460.80, CUDA Version 11.2 without a base CUDA install (i.e. no
nvcc
; I manage CUDA installs viaconda
):Failed attempts to fix the issue:
1)
conda
andpip
version ofopen3d
2)isl-org
compiled andconda
version of PyTorch1.7.1
3) Base images with CUDA 10.1 (nvidia/cuda:10.1-cudnn8-devel-ubuntu18.04
) and 11.1 (nvidia/cuda:11.1-cudnn8-devel-ubuntu18.04
) 4) Default instructions inside of a standardconda
environment (not inside a docker container) 5) Python 3.6 and 3.8A trivial standalone program that invokes this IoU function inside the given docker container operates correctly. The following script
correctly prints
Additionally, I have tried a trivial train pipeline inside another folder in order to avoid a possible namespace issue, but this did not fix the issue. I have noticed that when
OPEN3D_ML_ROOT
is not set, it uses another install ofOpen3D-ML
provided as part ofopen3d
, but the issue still persists.