hqucms / weaver-core

Streamlined neural network training.
MIT License
44 stars 54 forks source link

Unexpected error raised while using weaver DataLoader #7

Closed ryanliu30 closed 1 year ago

ryanliu30 commented 1 year ago

I am using the weaver DataLoader to load the JetClass dataset (I used the train_load function in train.py). However, the following error was raised when I attempted to launch a run after another run was finished in the same notebook:

terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: initialization error
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f20f854e4d7 in /home/ryanliu/.conda/envs/weaver/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f20f851836b in /home/ryanliu/.conda/envs/weaver/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f20f85f2fa8 in /home/ryanliu/.conda/envs/weaver/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0xdf9c37 (0x7f207a1b2c37 in /home/ryanliu/.conda/envs/weaver/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x4ccec6 (0x7f20f8aeaec6 in /home/ryanliu/.conda/envs/weaver/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #5: <unknown function> + 0x3ee77 (0x7f20f8533e77 in /home/ryanliu/.conda/envs/weaver/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #6: c10::TensorImpl::~TensorImpl() + 0x1be (0x7f20f852c69e in /home/ryanliu/.conda/envs/weaver/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #7: c10::TensorImpl::~TensorImpl() + 0x9 (0x7f20f852c7b9 in /home/ryanliu/.conda/envs/weaver/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #8: <unknown function> + 0x752478 (0x7f20f8d70478 in /home/ryanliu/.conda/envs/weaver/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #9: THPVariable_subclass_dealloc(_object*) + 0x305 (0x7f20f8d70805 in /home/ryanliu/.conda/envs/weaver/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #10: <unknown function> + 0x12a067 (0x5580c8b99067 in /home/ryanliu/.conda/envs/weaver/bin/python)
frame #11: <unknown function> + 0x18be85 (0x5580c8bfae85 in /home/ryanliu/.conda/envs/weaver/bin/python)
frame #12: <unknown function> + 0x120928 (0x5580c8b8f928 in /home/ryanliu/.conda/envs/weaver/bin/python)
frame #13: <unknown function> + 0x1d1b3e (0x5580c8c40b3e in /home/ryanliu/.conda/envs/weaver/bin/python)
frame #14: _PyObject_GC_NewVar + 0x245 (0x5580c8b86445 in /home/ryanliu/.conda/envs/weaver/bin/python)
frame #15: PyTuple_New + 0x117 (0x5580c8b8bfd7 in /home/ryanliu/.conda/envs/weaver/bin/python)
frame #16: <unknown function> + 0x12b11b (0x5580c8b9a11b in /home/ryanliu/.conda/envs/weaver/bin/python)
frame #17: <unknown function> + 0x12ae17 (0x5580c8b99e17 in /home/ryanliu/.conda/envs/weaver/bin/python)
frame #18: <unknown function> + 0x12b183 (0x5580c8b9a183 in /home/ryanliu/.conda/envs/weaver/bin/python)
frame #19: <unknown function> + 0x12adec (0x5580c8b99dec in /home/ryanliu/.conda/envs/weaver/bin/python)
frame #20: <unknown function> + 0x12b233 (0x5580c8b9a233 in /home/ryanliu/.conda/envs/weaver/bin/python)
frame #21: <unknown function> + 0x12adec (0x5580c8b99dec in /home/ryanliu/.conda/envs/weaver/bin/python)
frame #22: <unknown function> + 0x1d53c8 (0x5580c8c443c8 in /home/ryanliu/.conda/envs/weaver/bin/python)
frame #23: <unknown function> + 0x1e88c2 (0x5580c8c578c2 in /home/ryanliu/.conda/envs/weaver/bin/python)
frame #24: <unknown function> + 0x13cd4b (0x5580c8babd4b in /home/ryanliu/.conda/envs/weaver/bin/python)
frame #25: _PyEval_EvalFrameDefault + 0x4d1d (0x5580c8ba127d in /home/ryanliu/.conda/envs/weaver/bin/python)
frame #26: _PyFunction_Vectorcall + 0x6f (0x5580c8bac73f in /home/ryanliu/.conda/envs/weaver/bin/python)
frame #27: _PyEval_EvalFrameDefault + 0x13d0 (0x5580c8b9d930 in /home/ryanliu/.conda/envs/weaver/bin/python)
frame #28: _PyFunction_Vectorcall + 0x6f (0x5580c8bac73f in /home/ryanliu/.conda/envs/weaver/bin/python)
frame #29: _PyEval_EvalFrameDefault + 0x735 (0x5580c8b9cc95 in /home/ryanliu/.conda/envs/weaver/bin/python)
frame #30: _PyFunction_Vectorcall + 0x6f (0x5580c8bac73f in /home/ryanliu/.conda/envs/weaver/bin/python)
frame #31: _PyEval_EvalFrameDefault + 0x735 (0x5580c8b9cc95 in /home/ryanliu/.conda/envs/weaver/bin/python)
frame #32: _PyFunction_Vectorcall + 0x6f (0x5580c8bac73f in /home/ryanliu/.conda/envs/weaver/bin/python)
frame #33: _PyEval_EvalFrameDefault + 0x332 (0x5580c8b9c892 in /home/ryanliu/.conda/envs/weaver/bin/python)
frame #34: _PyFunction_Vectorcall + 0x6f (0x5580c8bac73f in /home/ryanliu/.conda/envs/weaver/bin/python)
frame #35: _PyEval_EvalFrameDefault + 0x332 (0x5580c8b9c892 in /home/ryanliu/.conda/envs/weaver/bin/python)
frame #36: _PyFunction_Vectorcall + 0x6f (0x5580c8bac73f in /home/ryanliu/.conda/envs/weaver/bin/python)
frame #37: <unknown function> + 0x13cee4 (0x5580c8babee4 in /home/ryanliu/.conda/envs/weaver/bin/python)
frame #38: _PyObject_CallMethodIdObjArgs + 0x16f (0x5580c8bba7af in /home/ryanliu/.conda/envs/weaver/bin/python)
frame #39: PyImport_ImportModuleLevelObject + 0x551 (0x5580c8bb9ab1 in /home/ryanliu/.conda/envs/weaver/bin/python)
frame #40: _PyEval_EvalFrameDefault + 0x3981 (0x5580c8b9fee1 in /home/ryanliu/.conda/envs/weaver/bin/python)
frame #41: <unknown function> + 0x1d5852 (0x5580c8c44852 in /home/ryanliu/.conda/envs/weaver/bin/python)
frame #42: PyEval_EvalCode + 0x87 (0x5580c8c44797 in /home/ryanliu/.conda/envs/weaver/bin/python)
frame #43: <unknown function> + 0x1dcde0 (0x5580c8c4bde0 in /home/ryanliu/.conda/envs/weaver/bin/python)
frame #44: <unknown function> + 0x13d934 (0x5580c8bac934 in /home/ryanliu/.conda/envs/weaver/bin/python)
frame #45: _PyEval_EvalFrameDefault + 0x5c81 (0x5580c8ba21e1 in /home/ryanliu/.conda/envs/weaver/bin/python)
frame #46: _PyFunction_Vectorcall + 0x6f (0x5580c8bac73f in /home/ryanliu/.conda/envs/weaver/bin/python)
frame #47: _PyEval_EvalFrameDefault + 0x4d1d (0x5580c8ba127d in /home/ryanliu/.conda/envs/weaver/bin/python)
frame #48: _PyFunction_Vectorcall + 0x6f (0x5580c8bac73f in /home/ryanliu/.conda/envs/weaver/bin/python)
frame #49: _PyEval_EvalFrameDefault + 0x735 (0x5580c8b9cc95 in /home/ryanliu/.conda/envs/weaver/bin/python)
frame #50: _PyFunction_Vectorcall + 0x6f (0x5580c8bac73f in /home/ryanliu/.conda/envs/weaver/bin/python)
frame #51: _PyEval_EvalFrameDefault + 0x332 (0x5580c8b9c892 in /home/ryanliu/.conda/envs/weaver/bin/python)
frame #52: _PyFunction_Vectorcall + 0x6f (0x5580c8bac73f in /home/ryanliu/.conda/envs/weaver/bin/python)
frame #53: _PyEval_EvalFrameDefault + 0x332 (0x5580c8b9c892 in /home/ryanliu/.conda/envs/weaver/bin/python)
frame #54: _PyFunction_Vectorcall + 0x6f (0x5580c8bac73f in /home/ryanliu/.conda/envs/weaver/bin/python)
frame #55: <unknown function> + 0x13cee4 (0x5580c8babee4 in /home/ryanliu/.conda/envs/weaver/bin/python)
frame #56: _PyObject_CallMethodIdObjArgs + 0x16f (0x5580c8bba7af in /home/ryanliu/.conda/envs/weaver/bin/python)
frame #57: PyImport_ImportModuleLevelObject + 0x551 (0x5580c8bb9ab1 in /home/ryanliu/.conda/envs/weaver/bin/python)
frame #58: _PyEval_EvalFrameDefault + 0x3981 (0x5580c8b9fee1 in /home/ryanliu/.conda/envs/weaver/bin/python)
frame #59: _PyFunction_Vectorcall + 0x6f (0x5580c8bac73f in /home/ryanliu/.conda/envs/weaver/bin/python)
frame #60: _PyEval_EvalFrameDefault + 0x13d0 (0x5580c8b9d930 in /home/ryanliu/.conda/envs/weaver/bin/python)
frame #61: _PyFunction_Vectorcall + 0x6f (0x5580c8bac73f in /home/ryanliu/.conda/envs/weaver/bin/python)
frame #62: _PyEval_EvalFrameDefault + 0x13d0 (0x5580c8b9d930 in /home/ryanliu/.conda/envs/weaver/bin/python)
frame #63: _PyFunction_Vectorcall + 0x6f (0x5580c8bac73f in /home/ryanliu/.conda/envs/weaver/bin/python)

The same error message was repeated eight times: the same number of workers I was using. I suspect that it has something to do with multiprocessing.

hqucms commented 1 year ago

Hi @ryanliu30 -- can you provide a minimal example to reproduce the error?

ryanliu30 commented 1 year ago

Hi @hqucms Sorry for the late reply. I wasn't able to reproduce the error without running my whole training script. FYI, I trained my model with pytorch_lightning and wandb in jupyter. However, I worked around it by removing dependencies of matplotlib, switching to plotly. You can close the issue if you would like to.