Open tchaton opened 5 years ago
Really sorry for the delay, I somehow have missed this. I think this could be the same cause as in Issue https://github.com/mys007/ecc/issues/1 , it's just being demonstrated in a different place due to missing CUDA_LAUNCH_BLOCKING
. Unfortunately, the issue has not been solved and a big rewrite might be the only way how to fix it. Or use pytorch_geometric
instead:P.
Config:
python 3.6.4 torch 1.2.0
This error is frequently triggered stopping the training.
Traceback (most recent call last): File "learning/main.py", line 607, in
main()
File "learning/main.py", line 455, in main
trainmetrics, = train()
File "learning/main.py", line 296, in train
outputs = model.ecc(embeddings[0], clouds_data[4:6])
File "/home/thomas/.pyenv/versions/spg3.6.4/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in call
result = self.forward(*input, *kwargs)
File "/home/thomas/HELIX/superpoint-graph-job/superpointgraph2/learning/../learning/graphnet.py", line 145, in forward
input = module(input)
File "/home/thomas/.pyenv/versions/spg3.6.4/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in call
result = self.forward(input, **kwargs)
File "/home/thomas/HELIX/superpoint-graph-job/superpointgraph2/learning/../learning/modules.py", line 88, in forward
input = ecc.GraphConvFunction(nc, nc, idxn, idxe, degs, degs_gpu, self._edge_mem_limit)(hx, weights)
File "/home/thomas/HELIX/superpoint-graph-job/superpointgraph2/learning/../learning/ecc/GraphConvModule.py", line 67, in forward
cuda_kernels.conv_aggregate_fw(output.narrow(0,startd,numd), products.view(-1,self._out_channels), self._degs_gpu.narrow(0,startd,numd))
File "/home/thomas/HELIX/superpoint-graph-job/superpointgraph2/learning/../learning/ecc/cuda_kernels.py", line 123, in conv_aggregate_fw
csdegs = torch.cumsum(degs,0)
RuntimeError: scan failed to synchronize: an illegal memory access was encountered
Traceback (most recent call last): File "cupy/cuda/driver.pyx", line 193, in cupy.cuda.driver.moduleUnload File "cupy/cuda/driver.pyx", line 82, in cupy.cuda.driver.check_status TypeError: 'NoneType' object is not callable Exception ignored in: 'cupy.cuda.function.Module.dealloc' Traceback (most recent call last): File "cupy/cuda/driver.pyx", line 193, in cupy.cuda.driver.moduleUnload File "cupy/cuda/driver.pyx", line 82, in cupy.cuda.driver.check_status TypeError: 'NoneT