lzhengning / SubdivNet

Subdivision-based Mesh Convolutional Networks.
MIT License
247 stars 34 forks source link

Can not train an epoch #43

Open Licolas opened 2 months ago

Licolas commented 2 months ago

It always stops at here.

(subdivnet) lm@lm:~/0-majorRevision/SubdivNet-master$ sh scripts/manifold40/train.sh [i 0711 09:55:59.289227 96 compiler.py:956] Jittor(1.3.8.5) src: /home/lm/anaconda3/envs/subdivnet/lib/python3.7/site-packages/jittor [i 0711 09:55:59.298060 96 compiler.py:957] g++ at /usr/bin/g++(11.4.0) [i 0711 09:55:59.298134 96 compiler.py:958] cache_path: /home/lm/.cache/jittor/jt1.3.8/g++11.4.0/py3.7.16/Linux-6.5.0-41xc8/IntelRXeonRSilxdc/default [i 0711 09:55:59.307678 96 init.py:411] Found nvcc(11.7.99) at /usr/local/cuda-11.7/bin/nvcc. [i 0711 09:55:59.383568 96 init.py:411] Found gdb(22.04.2) at /usr/bin/gdb. [i 0711 09:55:59.397309 96 init.py:411] Found addr2line(2.38) at /usr/bin/addr2line. [i 0711 09:55:59.510948 96 compiler.py:1011] cuda key:cu11.7.99_sm_89 [i 0711 09:56:00.005350 96 init.py:227] Total mem: 62.44GB, using 16 procs for compiling. Compiling jittor_core(151/151) used: 2.437s eta: 0.000s [i 0711 09:56:02.815749 96 jit_compiler.cc:28] Load cc_path: /usr/bin/g++ [i 0711 09:56:02.888116 96 init.cc:62] Found cuda archs: [89,] [w 0711 09:56:02.903832 96 compiler.py:1384] CUDA arch(89)>86 will be backward-compatible [w 0711 09:56:02.935237 96 compile_extern.py:203] CUDA related path found in LD_LIBRARY_PATH or PATH(['', '/usr/local/cuda-11.7/lib64', '/home/lm/anaconda3/envs/subdivnet/bin', '/home/lm/anaconda3/condabin', '/usr/local/sbin', '/usr/local/bin', '/usr/sbin', '/usr/bin', '/sbin', '/bin', '/usr/games', '/usr/local/games', '/snap/bin', '/snap/bin', '/usr/local/cuda-11.7/bin']), This path may cause jittor found the wrong libs, please unset LD_LIBRARY_PATH and remove cuda lib path in Path. Or you can let jittor install cuda for you: python3.x -m jittor_utils.install_cuda [i 0711 09:56:12.951927 96 cuda_flags.cc:49] CUDA enabled. name: manifold40 Train 0: 0%|▍ | 12/3278 [00:06<20:55, 2.60it/s][w 0711 09:56:20.710701 96 cudnn_convTx_float32__Ty_float32Tw_float32XFORMAT_abcd__WFORMAT_oihwYFORMAT_abcd_hash_4d5b3e2d24c769d3op.cc:419] forward algorithm cache is full Train 0: 0%|▍ | 13/3278 [00:06<21:05, 2.58it/s][w 0711 09:56:20.865463 96 cudnn_conv_backward_wTx_float32Ty_float32Tw_float32__XFORMAT_abcdWFORMAT_oihwYFO___hash_8e480e8564e59906_op.cc:418] backward w algorithm cache is full Train 0: 0%|▍ | 15/3278 [00:07<19:45, 2.75it/s][w 0711 09:56:21.510013 96 cudnn_conv_backward_xTx_float32Ty_float32__Tw_float32XFORMAT_abcdWFORMAT_oihw_YFOhash_af8994a8aef53c1c_op.cc:410] backward x algorithm cache is full Train 0: 67%|████████████████████████████████████████████████████████████████████▌ | 2184/3278 [10:19<05:21, 3.40it/s]

log is as follow:


Async error was detected. To locate the async backtrace and get better error report, please rerun your code with two enviroment variables set:

export JT_SYNC=1 export trace_py_var=3