CUDA error: an illegal memory access

rusty1s / pytorch_spline_conv

Implementation of the Spline-Based Convolution Operator of SplineCNN in PyTorch

https://arxiv.org/abs/1711.08920

MIT License

172 stars 36 forks source link

CUDA error: an illegal memory access #41

Closed kvndhrty closed 8 months ago

kvndhrty commented 1 year ago

Running a Torch 2.0.1 environment (tried with both CUDA 11.7 and 11.8) and PyTorch Lightning. A fairly simple model fails during backprop and throws:

RuntimeError: CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

This same model also fails if used on the CPU but throws no error, just quietly exits at the first backprop stage.

I'm on a Windows x86 machine.

kvndhrty commented 1 year ago

I should note that swapping every SplineConv layer out for a GCNConv layer eliminates the error, so it somehow traces back to SplineConv but I'm not sure how.

rusty1s commented 1 year ago

Do you have a reproducible example for me?

kvndhrty commented 1 year ago

I'll put together a barebones example today or tomorrow, hoping that running this on a Linux machine resolves my issue in the short term.

kvndhrty commented 1 year ago

Fixed, I had some values in my edge_attr values that were small negative numbers. It was hard to tell from the seg fault / illegal memory errors where it came from.

github-actions[bot] commented 8 months ago

This issue had no activity for 6 months. It will be closed in 2 weeks unless there is some new activity. Is this issue already resolved?

kvndhrty commented 8 months ago

Closing this