Illegal Memory Access Error When Running Training on GPU Cluster

yaqlee commented 1 year ago

I wrote a feature builder and target builder and created a model class that inherits from torchModuleWrapper. I also wrote a new objective class based on my model. These components run without errors on my local machine, but I encountered the following error when running run_training on a cloud GPU cluster:

terminate called after throwing an instance of 'c10::CUDAError' what(): CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Exception raised from query at ../aten/src/ATen/cuda/CUDAEvent.h:95 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7fb5426d9d62 in /root/miniconda3/lib/python3.9/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x11a (0x7fb5d0f2024a in /root/miniconda3/lib/python3.9/site-packages/torch/lib/libtorch_cuda_cpp.so) frame #2: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x50 (0x7fb5d0f22540 in /root/miniconda3/lib/python3.9/site-packages/torch/lib/libtorch_cuda_cpp.so) frame #3: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x11c (0x7fb5d0f2300c in /root/miniconda3/lib/python3.9/site-packages/torch/lib/libtorch_cuda_cpp.so) frame #4: + 0xd6de4 (0x7fb63eedfde4 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6) frame #5: + 0x8609 (0x7fb640d0f609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0) frame #6: clone + 0x43 (0x7fb640ada133 in /usr/lib/x86_64-linux-gnu/libc.so.6)

I'm wondering what might be causing this error. Could it be because my model code doesn't follow the libtorch convention of specifying data types for each function's parameters? Any insights or suggestions on how to resolve this issue would be greatly appreciated.

Thank you!

yaqlee commented 1 year ago

btw, when I run training on my local machine, the dataset was very little, as I use 'scenario_builder=nuplan_mini', and 'scenario_filter.limit_total_scenarios=0.001'.

patk-motional commented 1 year ago

Hi @yaqlee,

Do you have the same requirements and dependencies installed on your remote machines?

P.S. It is admittedly very difficult for us to help you with your custom code and custom cloud setup. We'll try our best to answer what we can.

motional / nuplan-devkit

Illegal Memory Access Error When Running Training on GPU Cluster #281