jzhang38 / EasyContext

Memory optimization and training recipes to extrapolate language models' context length to 1 million tokens, with minimal hardware.
Apache License 2.0
529 stars 33 forks source link

error when finetuning yi-34b #13

Open puppet101 opened 2 months ago

puppet101 commented 2 months ago

Hi, thank you for this great project. I am finetuning yi-34b, and when loading the model, it occurs cuda oom error. So i just change the zero3_init_flag to true to avoid oom when loading the model. But when training, there are some other errors, I paste the errors here, could you please help me? Thank you!

/opt/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [257,0,0], thread: [31,0,0] Assertion srcIndex < srcSelectDimSize failed. [rank6]:[E410 09:24:27.054138428 ProcessGroupNCCL.cpp:1430] [PG 0 Rank 6] Process group watchdog thread terminated with exception: CUDA error: device-side assert triggered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at /opt/pytorch/c10/cuda/CUDAException.cpp:43 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::cxx11::basic_string<char, std::char_traits, std::allocator >) + 0xae (0x7fd1a42fb67e in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so) frame #1: c10::detail::torchCheckFail(char const, char const, unsigned int, std::cxx11::basic_string<char, std::char_traits, std::allocator > const&) + 0xf3 (0x7fd1a42a5375 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so) frame #2: c10::cuda::c10_cuda_check_implementation(int, char const, char const, int, bool) + 0x3f2 (0x7fd1a43b0612 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so) frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x5e (0x7fd182ac63de in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x78 (0x7fd182aca678 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x8ad (0x7fd182ad2fbd in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x128 (0x7fd182ad3c08 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #7: + 0xdc253 (0x7fd1a3eb0253 in /lib/x86_64-linux-gnu/libstdc++.so.6) frame #8: + 0x94ac3 (0x7fd1a4e6bac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #9: clone + 0x44 (0x7fd1a4efca04 in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG 0 Rank 6] Process group watchdog thread terminated with exception: CUDA error: device-side assert triggered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at /opt/pytorch/c10/cuda/CUDAException.cpp:43 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::cxx11::basic_string<char, std::char_traits, std::allocator >) + 0xae (0x7fd1a42fb67e in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so) frame #1: c10::detail::torchCheckFail(char const, char const, unsigned int, std::cxx11::basic_string<char, std::char_traits, std::allocator > const&) + 0xf3 (0x7fd1a42a5375 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so) frame #2: c10::cuda::c10_cuda_check_implementation(int, char const, char const, int, bool) + 0x3f2 (0x7fd1a43b0612 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so) frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x5e (0x7fd182ac63de in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x78 (0x7fd182aca678 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x8ad (0x7fd182ad2fbd in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x128 (0x7fd182ad3c08 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #7: + 0xdc253 (0x7fd1a3eb0253 in /lib/x86_64-linux-gnu/libstdc++.so.6) frame #8: + 0x94ac3 (0x7fd1a4e6bac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #9: clone + 0x44 (0x7fd1a4efca04 in /lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at /opt/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1434 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0xae (0x7fd1a42fb67e in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so) frame #1: + 0xfded22 (0x7fd182afbd22 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #2: + 0xd342da (0x7fd1828512da in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #3: + 0xdc253 (0x7fd1a3eb0253 in /lib/x86_64-linux-gnu/libstdc++.so.6) frame #4: + 0x94ac3 (0x7fd1a4e6bac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #5: clone + 0x44 (0x7fd1a4efca04 in /lib/x86_64-linux-gnu/libc.so.6)

/opt/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [68,0,0], thread: [64,0,0] Assertion srcIndex < srcSelectDimSize failed. /opt/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [68,0,0], thread: [65,0,0] Assertion srcIndex < srcSelectDimSize failed. /opt/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [68,0,0], thread: [66,0,0] Assertion srcIndex < srcSelectDimSize failed. /opt/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [68,0,0], thread: [67,0,0] Assertion srcIndex < srcSelectDimSize failed. /opt/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [68,0,0], thread: [68,0,0] Assertion srcIndex < srcSelectDimSize failed. /opt/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [68,0,0], thread: [69,0,0] Assertion srcIndex < srcSelectDimSize failed. /opt/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [68,0,0], thread: [70,0,0] Assertion srcIndex < srcSelectDimSize failed. /opt/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [68,0,0], thread: [71,0,0] Assertion srcIndex < srcSelectDimSize failed. /opt/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [68,0,0], thread: [72,0,0] Assertion srcIndex < srcSelectDimSize failed. /opt/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [68,0,0], thread: [73,0,0] Assertion srcIndex < srcSelectDimSize failed. /opt/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [68,0,0], thread: [74,0,0] Assertion srcIndex < srcSelectDimSize failed. /opt/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [68,0,0], thread: [75,0,0] Assertion srcIndex < srcSelectDimSize failed. /opt/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [68,0,0], thread: [76,0,0] Assertion srcIndex < srcSelectDimSize failed. /opt/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [68,0,0], thread: [77,0,0] Assertion srcIndex < srcSelectDimSize failed. /opt/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [68,0,0], thread: [78,0,0] Assertion srcIndex < srcSelectDimSize failed. /opt/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [68,0,0], thread: [79,0,0] Assertion srcIndex < srcSelectDimSize failed. /opt/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [68,0,0], thread: [80,0,0] Assertion srcIndex < srcSelectDimSize failed. /opt/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [68,0,0], thread: [81,0,0] Assertion srcIndex < srcSelectDimSize failed. /opt/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [68,0,0], thread: [82,0,0] Assertion srcIndex < srcSelectDimSize failed. /opt/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [68,0,0], thread: [83,0,0] Assertion srcIndex < srcSelectDimSize failed. /opt/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [68,0,0], thread: [84,0,0] Assertion srcIndex < srcSelectDimSize failed.