mys007 / ecc

Edge-Conditioned Convolutions on Graphs
228 stars 48 forks source link

Frequent Error #7

Open tchaton opened 5 years ago

tchaton commented 5 years ago

Config:

python 3.6.4 torch 1.2.0

This error is frequently triggered stopping the training.

Traceback (most recent call last): File "learning/main.py", line 607, in main() File "learning/main.py", line 455, in main trainmetrics, = train() File "learning/main.py", line 296, in train outputs = model.ecc(embeddings[0], clouds_data[4:6]) File "/home/thomas/.pyenv/versions/spg3.6.4/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in call result = self.forward(*input, *kwargs) File "/home/thomas/HELIX/superpoint-graph-job/superpointgraph2/learning/../learning/graphnet.py", line 145, in forward input = module(input) File "/home/thomas/.pyenv/versions/spg3.6.4/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in call result = self.forward(input, **kwargs) File "/home/thomas/HELIX/superpoint-graph-job/superpointgraph2/learning/../learning/modules.py", line 88, in forward input = ecc.GraphConvFunction(nc, nc, idxn, idxe, degs, degs_gpu, self._edge_mem_limit)(hx, weights) File "/home/thomas/HELIX/superpoint-graph-job/superpointgraph2/learning/../learning/ecc/GraphConvModule.py", line 67, in forward cuda_kernels.conv_aggregate_fw(output.narrow(0,startd,numd), products.view(-1,self._out_channels), self._degs_gpu.narrow(0,startd,numd)) File "/home/thomas/HELIX/superpoint-graph-job/superpointgraph2/learning/../learning/ecc/cuda_kernels.py", line 123, in conv_aggregate_fw csdegs = torch.cumsum(degs,0) RuntimeError: scan failed to synchronize: an illegal memory access was encountered

Traceback (most recent call last): File "cupy/cuda/driver.pyx", line 193, in cupy.cuda.driver.moduleUnload File "cupy/cuda/driver.pyx", line 82, in cupy.cuda.driver.check_status TypeError: 'NoneType' object is not callable Exception ignored in: 'cupy.cuda.function.Module.dealloc' Traceback (most recent call last): File "cupy/cuda/driver.pyx", line 193, in cupy.cuda.driver.moduleUnload File "cupy/cuda/driver.pyx", line 82, in cupy.cuda.driver.check_status TypeError: 'NoneT

mys007 commented 5 years ago

Really sorry for the delay, I somehow have missed this. I think this could be the same cause as in Issue https://github.com/mys007/ecc/issues/1 , it's just being demonstrated in a different place due to missing CUDA_LAUNCH_BLOCKING. Unfortunately, the issue has not been solved and a big rewrite might be the only way how to fix it. Or use pytorch_geometric instead:P.