Hi, thank you for this great project. I am finetuning yi-34b, and when loading the model, it occurs cuda oom error. So i just change the zero3_init_flag to true to avoid oom when loading the model. But when training, there are some other errors, I paste the errors here, could you please help me? Thank you!
/opt/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [257,0,0], thread: [31,0,0] Assertion srcIndex < srcSelectDimSize failed.
[rank6]:[E410 09:24:27.054138428 ProcessGroupNCCL.cpp:1430] [PG 0 Rank 6] Process group watchdog thread terminated with exception: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
Exception raised from c10_cuda_check_implementation at /opt/pytorch/c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::cxx11::basic_string<char, std::char_traits, std::allocator >) + 0xae (0x7fd1a42fb67e in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const, char const, unsigned int, std::cxx11::basic_string<char, std::char_traits, std::allocator > const&) + 0xf3 (0x7fd1a42a5375 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const, char const, int, bool) + 0x3f2 (0x7fd1a43b0612 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x5e (0x7fd182ac63de in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x78 (0x7fd182aca678 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x8ad (0x7fd182ad2fbd in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x128 (0x7fd182ad3c08 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #7: + 0xdc253 (0x7fd1a3eb0253 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #8: + 0x94ac3 (0x7fd1a4e6bac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #9: clone + 0x44 (0x7fd1a4efca04 in /lib/x86_64-linux-gnu/libc.so.6)
terminate called after throwing an instance of 'c10::DistBackendError'
what(): [PG 0 Rank 6] Process group watchdog thread terminated with exception: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
Exception raised from c10_cuda_check_implementation at /opt/pytorch/c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::cxx11::basic_string<char, std::char_traits, std::allocator >) + 0xae (0x7fd1a42fb67e in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const, char const, unsigned int, std::cxx11::basic_string<char, std::char_traits, std::allocator > const&) + 0xf3 (0x7fd1a42a5375 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const, char const, int, bool) + 0x3f2 (0x7fd1a43b0612 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x5e (0x7fd182ac63de in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x78 (0x7fd182aca678 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x8ad (0x7fd182ad2fbd in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x128 (0x7fd182ad3c08 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #7: + 0xdc253 (0x7fd1a3eb0253 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #8: + 0x94ac3 (0x7fd1a4e6bac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #9: clone + 0x44 (0x7fd1a4efca04 in /lib/x86_64-linux-gnu/libc.so.6)
Exception raised from ncclCommWatchdog at /opt/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1434 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0xae (0x7fd1a42fb67e in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: + 0xfded22 (0x7fd182afbd22 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: + 0xd342da (0x7fd1828512da in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: + 0xdc253 (0x7fd1a3eb0253 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #4: + 0x94ac3 (0x7fd1a4e6bac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #5: clone + 0x44 (0x7fd1a4efca04 in /lib/x86_64-linux-gnu/libc.so.6)
Hi, thank you for this great project. I am finetuning yi-34b, and when loading the model, it occurs cuda oom error. So i just change the zero3_init_flag to true to avoid oom when loading the model. But when training, there are some other errors, I paste the errors here, could you please help me? Thank you!
/opt/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [257,0,0], thread: [31,0,0] Assertion
srcIndex < srcSelectDimSize
failed. [rank6]:[E410 09:24:27.054138428 ProcessGroupNCCL.cpp:1430] [PG 0 Rank 6] Process group watchdog thread terminated with exception: CUDA error: device-side assert triggered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile withTORCH_USE_CUDA_DSA
to enable device-side assertions.Exception raised from c10_cuda_check_implementation at /opt/pytorch/c10/cuda/CUDAException.cpp:43 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::cxx11::basic_string<char, std::char_traits, std::allocator >) + 0xae (0x7fd1a42fb67e in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const, char const, unsigned int, std:: cxx11::basic_string<char, std::char_traits, std::allocator > const&) + 0xf3 (0x7fd1a42a5375 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const, char const, int, bool) + 0x3f2 (0x7fd1a43b0612 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x5e (0x7fd182ac63de in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x78 (0x7fd182aca678 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x8ad (0x7fd182ad2fbd in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x128 (0x7fd182ad3c08 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #7: + 0xdc253 (0x7fd1a3eb0253 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #8: + 0x94ac3 (0x7fd1a4e6bac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #9: clone + 0x44 (0x7fd1a4efca04 in /lib/x86_64-linux-gnu/libc.so.6)
terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG 0 Rank 6] Process group watchdog thread terminated with exception: CUDA error: device-side assert triggered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with
TORCH_USE_CUDA_DSA
to enable device-side assertions.Exception raised from c10_cuda_check_implementation at /opt/pytorch/c10/cuda/CUDAException.cpp:43 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::cxx11::basic_string<char, std::char_traits, std::allocator >) + 0xae (0x7fd1a42fb67e in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const, char const, unsigned int, std:: cxx11::basic_string<char, std::char_traits, std::allocator > const&) + 0xf3 (0x7fd1a42a5375 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const, char const, int, bool) + 0x3f2 (0x7fd1a43b0612 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x5e (0x7fd182ac63de in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x78 (0x7fd182aca678 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x8ad (0x7fd182ad2fbd in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x128 (0x7fd182ad3c08 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #7: + 0xdc253 (0x7fd1a3eb0253 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #8: + 0x94ac3 (0x7fd1a4e6bac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #9: clone + 0x44 (0x7fd1a4efca04 in /lib/x86_64-linux-gnu/libc.so.6)
Exception raised from ncclCommWatchdog at /opt/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1434 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0xae (0x7fd1a42fb67e in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: + 0xfded22 (0x7fd182afbd22 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: + 0xd342da (0x7fd1828512da in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: + 0xdc253 (0x7fd1a3eb0253 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #4: + 0x94ac3 (0x7fd1a4e6bac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #5: clone + 0x44 (0x7fd1a4efca04 in /lib/x86_64-linux-gnu/libc.so.6)
/opt/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [68,0,0], thread: [64,0,0] Assertion
srcIndex < srcSelectDimSize
failed. /opt/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [68,0,0], thread: [65,0,0] AssertionsrcIndex < srcSelectDimSize
failed. /opt/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [68,0,0], thread: [66,0,0] AssertionsrcIndex < srcSelectDimSize
failed. /opt/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [68,0,0], thread: [67,0,0] AssertionsrcIndex < srcSelectDimSize
failed. /opt/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [68,0,0], thread: [68,0,0] AssertionsrcIndex < srcSelectDimSize
failed. /opt/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [68,0,0], thread: [69,0,0] AssertionsrcIndex < srcSelectDimSize
failed. /opt/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [68,0,0], thread: [70,0,0] AssertionsrcIndex < srcSelectDimSize
failed. /opt/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [68,0,0], thread: [71,0,0] AssertionsrcIndex < srcSelectDimSize
failed. /opt/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [68,0,0], thread: [72,0,0] AssertionsrcIndex < srcSelectDimSize
failed. /opt/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [68,0,0], thread: [73,0,0] AssertionsrcIndex < srcSelectDimSize
failed. /opt/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [68,0,0], thread: [74,0,0] AssertionsrcIndex < srcSelectDimSize
failed. /opt/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [68,0,0], thread: [75,0,0] AssertionsrcIndex < srcSelectDimSize
failed. /opt/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [68,0,0], thread: [76,0,0] AssertionsrcIndex < srcSelectDimSize
failed. /opt/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [68,0,0], thread: [77,0,0] AssertionsrcIndex < srcSelectDimSize
failed. /opt/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [68,0,0], thread: [78,0,0] AssertionsrcIndex < srcSelectDimSize
failed. /opt/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [68,0,0], thread: [79,0,0] AssertionsrcIndex < srcSelectDimSize
failed. /opt/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [68,0,0], thread: [80,0,0] AssertionsrcIndex < srcSelectDimSize
failed. /opt/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [68,0,0], thread: [81,0,0] AssertionsrcIndex < srcSelectDimSize
failed. /opt/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [68,0,0], thread: [82,0,0] AssertionsrcIndex < srcSelectDimSize
failed. /opt/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [68,0,0], thread: [83,0,0] AssertionsrcIndex < srcSelectDimSize
failed. /opt/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [68,0,0], thread: [84,0,0] AssertionsrcIndex < srcSelectDimSize
failed.