I am relatively new so I hope I am not just doing something very stupid :)
I am trying to adapt the quantized example for my use case. The inference code is pretty much the same as the example. In general, the code works and I am prompting 2 models on 2 separate GPUs in a loop. After N iterations (N is different every time but in range <100) I encounter the error below.
I am running quantized llama-3-8b-instruct from .gguf.
I would appreciate any tips on this topic if the error is on my side. Here is the access to the code.
NOTE: I'm running two A6000 GPUs. This is the nvcc version:
$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Thu_Nov_18_09:45:30_PST_2021
Cuda compilation tools, release 11.5, V11.5.119
Build cuda_11.5.r11.5/compiler.30672275_0
thread 'thread '<unnamed><unnamed>' panicked at ' panicked at /home/vake/.cargo/registry/src/index.crates.io-6f17d22bba15001f/cudarc-0.10.0/src/driver/safe/core.rs/home/vake/.cargo/registry/src/index.crates.io-6f17d22bba15001f/cudarc-0.10.0/src/driver/safe/core.rs::208208::7676:
:
called `Result::unwrap()` on an `Err` value: DriverError(CUDA_ERROR_ILLEGAL_ADDRESS, "an illegal memory access was encountered")called `Result::unwrap()` on an `Err` value: DriverError(CUDA_ERROR_ILLEGAL_ADDRESS, "an illegal memory access was encountered")
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
thread 'thread '<unnamed><unnamed>' panicked at ' panicked at /home/vake/.cargo/registry/src/index.crates.io-6f17d22bba15001f/cudarc-0.10.0/src/driver/safe/core.rs/home/vake/.cargo/registry/src/index.crates.io-6f17d22bba15001f/cudarc-0.10.0/src/driver/safe/core.rs::208208::7676:
:
called `Result::unwrap()` on an `Err` value: DriverError(CUDA_ERROR_ILLEGAL_ADDRESS, "an illegal memory access was encountered")called `Result::unwrap()` on an `Err` value: DriverError(CUDA_ERROR_ILLEGAL_ADDRESS, "an illegal memory access was encountered")
stack backtrace:
0: 0x58c00bd19556 - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::h410d4c66be4e37f9
1: 0x58c00bd43550 - core::fmt::write::he40921d4802ce2ac
2: 0x58c00bd16d4f - std::io::Write::write_fmt::h5de5a4e7037c9b20
3: 0x58c00bd19334 - std::sys_common::backtrace::print::h11c067a88e3bdb22
4: 0x58c00bd1abb7 - std::panicking::default_hook::{{closure}}::h8c832ecb03fde8ea
5: 0x58c00bd1a919 - std::panicking::default_hook::h1633e272b4150cf3
6: 0x58c00bd1b048 - std::panicking::rust_panic_with_hook::hb164d19c0c1e71d4
7: 0x58c00bd1af22 - std::panicking::begin_panic_handler::{{closure}}::h0369088c533c20e9
8: 0x58c00bd19a56 - std::sys_common::backtrace::__rust_end_short_backtrace::hc11d910daf35ac2e
9: 0x58c00bd1ac74 - rust_begin_unwind
10: 0x58c00b9113d5 - core::panicking::panic_fmt::ha6effc2775a0749c
11: 0x58c00b911923 - core::result::unwrap_failed::ha188096f98826595
12: 0x58c00ba2b6c4 - <cudarc::driver::safe::core::CudaSlice<T> as core::ops::drop::Drop>::drop::h4c289e05ebd51ae6
13: 0x58c00ba2aafc - core::ptr::drop_in_place<cudarc::driver::safe::core::CudaSlice<f32>>::hcbf6a15615cee068
14: 0x58c00ba2b1ca - alloc::sync::Arc<T,A>::drop_slow::h994a5bb01f1fc442
15: 0x58c00ba2af50 - alloc::sync::Arc<T,A>::drop_slow::h4a65dc7109aa30f1
16: 0x58c00ba1802a - candle_transformers::models::quantized_llama::ModelWeights::forward::had1312fe871968d8
17: 0x58c00b94121d - llm_bitcoin_inscription_analysis::llm::prompt::prompt_model::hbe917d2214140c60
18: 0x58c00b96e876 - core::ops::function::impls::<impl core::ops::function::FnMut<A> for &F>::call_mut::h5f9d812f749ee289
19: 0x58c00b96b756 - rayon::iter::plumbing::Folder::consume_iter::h2c8efde69e0f7383
20: 0x58c00b971bfc - rayon::iter::plumbing::bridge_producer_consumer::helper::h814a881abff08b3e
21: 0x58c00b973006 - <rayon_core::job::StackJob<L,F,R> as rayon_core::job::Job>::execute::h8fb2eedfc5ec12fd
22: 0x58c00b90ce9f - rayon_core::registry::WorkerThread::wait_until_cold::hc0ea83de9f250620
23: 0x58c00bceaa32 - rayon_core::registry::ThreadBuilder::run::hedc5a5eddbc123f1
24: 0x58c00bcedbca - std::sys_common::backtrace::__rust_begin_short_backtrace::h14baabb9af848a11
25: 0x58c00bceeaef - core::ops::function::FnOnce::call_once{{vtable.shim}}::h49599ea7439698c3
26: 0x58c00bd1fb95 - std::sys::pal::unix::thread::Thread::new::thread_start::h3631815ad38387d6
27: 0x7b8d4de94ac3 - start_thread
at ./nptl/pthread_create.c:442:8
28: 0x7b8d4df26850 - __GI___clone3
at ./misc/../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
29: 0x0 - <unknown>
stack backtrace:
thread '<unnamed>' panicked at library/core/src/panicking.rs:163 : 5 :
0panic in a destructor during cleanup:
thread caused non-unwinding panic. aborting.
0x58c00bd19556 - <std::sys_common::backtraceAborted (core dumped)
I am relatively new so I hope I am not just doing something very stupid :) I am trying to adapt the quantized example for my use case. The inference code is pretty much the same as the example. In general, the code works and I am prompting 2 models on 2 separate GPUs in a loop. After N iterations (N is different every time but in range <100) I encounter the error below. I am running quantized llama-3-8b-instruct from
.gguf
.I would appreciate any tips on this topic if the error is on my side. Here is the access to the code.
NOTE: I'm running two A6000 GPUs. This is the nvcc version: