traveller59 / second.pytorch

SECOND for KITTI/NuScenes object detection
MIT License
1.72k stars 722 forks source link

train.py core dumped in scn_input() #30

Closed teddybuy closed 5 years ago

teddybuy commented 5 years ago

python3 ./second/pytorch/train.py train --config_path=./second/configs/car.config --model_dir=/media/1t/data/kitti/second_model Segmentation fault (core dumped)

seems it core dump at voxelnet.py: line 278 ret = self.scn_input((coors.cpu(), voxel_features, batch_size))

Here is docker file I used to generate the docker image. (I cp the extension.h fro pytorch 1.0)

#docker build -f Dockerfile-python35-pytorch41  -t vacuum/pytorch:python35-pytorch41-simple-v1
From nvidia/cuda:9.1-cudnn7-devel-ubuntu16.04
#From nvidia/cuda:9.1-base-ubuntu16.04
RUN apt update -y
RUN apt-get install software-properties-common python-software-properties -y
RUN add-apt-repository ppa:deadsnakes/ppa
RUN apt update -y && apt install -y \
    python3.6 \
    python3-pip \
    python3-tk \
    libglib2.0-0 \
    libsm6 \
    libxext6 \
    libfontconfig1 \
    libxrender1 \
    vim \
    less \
    git 

RUN python3.6 -m pip install torch torchvision opencv-python
RUN python3.6 -m pip install shapely fire pybind11 pyqtgraph tensorboardX protobuf numba
RUN apt-get install libboost-all-dev -y
RUN apt-get install -y cuda-nvprof-9-1
RUN apt-get install -y libsparsehash-dev
RUN apt-get install -y python3.6-dev
RUN python3.6 -m pip install pillow
RUN rm -fr /usr/bin/python
RUN rm -fr /usr/bin/python3
RUN ln -s /usr/bin/python3.6 /usr/bin/python
RUN ln -s /usr/bin/python3.6 /usr/bin/python3
RUN git clone https://github.com/facebookresearch/SparseConvNet.git
COPY extension.h /usr/local/lib/python3.6/dist-packages/torch/lib/include/torch/extension.h
RUN cd SparseConvNet && bash build.sh && cd ..
ENV NUMBAPRO_CUDA_DRIVER /usr/lib/x86_64-linux-gnu/libcuda.so
ENV NUMBAPRO_NVVM /usr/local/cuda/nvvm/lib64/libnvvm.so
ENV NUMBAPRO_LIBDEVICE /usr/local/cuda/nvvm/libdevice
RUN apt install -y gdb psmisc
traveller59 commented 5 years ago

I can't build sparseconvnet master with 18.04, gcc-6, cuda 9.0 and pytorch-nightly-1.0dev20181106... consider using torch 0.4.1 and sparseconvnet edf89af339ee929d9416f3509ff405450949f606.

teddybuy commented 5 years ago

Thanks for the quick response!

I tried spareseconvnet edf89af339ee929d9416f3509ff405450949f606, but still coredump.

I checked in my docker file and scripts to build/run docker image here, maybe you can build the docker image and try it, thanks! https://github.com/teddybuy/second.pytorch/tree/master/docker

docker host: nvidia driver version 396.44

in container (python 3.6.7, pytorch 0.4.1, gcc 5.4.0, cuda 9.1.85, cudnn 7.1, ubuntu 16.04.5):

aa7786ad552f:~/code/second.pytorch$ python
Python 3.6.7 (default, Oct 21 2018, 04:56:05) 
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> print(torch.__version__)
0.4.1

aa7786ad552f:~/code/second.pytorch$ gcc --version
gcc (Ubuntu 5.4.0-6ubuntu1~16.04.10) 5.4.0 20160609
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

aa7786ad552f:~/code/second.pytorch$ cat /etc/issue
Ubuntu 16.04.5 LTS \n \l

aa7786ad552f:~/code/second.pytorch$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2017 NVIDIA Corporation
Built on Fri_Nov__3_21:07:56_CDT_2017
Cuda compilation tools, release 9.1, V9.1.85

aa7786ad552f:~/code/second.pytorch$ cat /usr/include/x86_64-linux-gnu/cudnn_v*.h | grep CUDNN_MAJOR -A 2  
#define CUDNN_MAJOR 7
#define CUDNN_MINOR 1
#define CUDNN_PATCHLEVEL 2
teddybuy commented 5 years ago

I found the problem. sparseconvnet cannot be built and installed via Dockerfile, have to use nvidia-docker run .. bash install then docker commit.