microsoft / aici

AICI: Prompts as (Wasm) Programs
MIT License
1.94k stars 78 forks source link

confusing error for missing CUDA compute cap 8.0 #87

Open vargonis opened 7 months ago

vargonis commented 7 months ago

I tried to run the Cuda server from within a container, but a thread panics:

running /workspace/aici/target/release/rllm-cuda --verbose --aicirt /workspace/aici/target/release/aicirt -m microsoft/phi-2@d3186761bf5c4409f7679359284066c25ab668ee -t phi -w /workspace/aici/rllm/rllm-cuda/expected/phi-2/cats.safetensors  --host 0.0.0.0
INFO [rllm::server] explicit tokenizer: phi
INFO [aicirt::bintokens] loading tokenizer: microsoft/phi-1_5
INFO [hf_hub] Token file not found "/root/.cache/huggingface/token"
tokenizer.json [00:00:00] [████████████████████████████████████████████████████████████████] 2.02 MiB/2.02 MiB 3.86 MiB/s (0s)INFO [rllm::engine] TokTrie building: TokRxInfo { vocab_size: 50295, tok_eos: 50256 } wl=50295
INFO [hf_hub] Token file not found "/root/.cache/huggingface/token"
INFO [rllm_cuda::llm::loader] loading the model from https://huggingface.co/microsoft/phi-2/resolve/d3186761bf5c4409f7679359284066c25ab668ee/
config.json [00:00:00] [█████████████████████████████████████████████████████████████████████████] 755 B/755 B 6.69 KiB/s (0s)INFO [aicirt::bintokens] loading tokenizer: microsoft/phi-1_5
INFO [hf_hub] Token file not found "/root/.cache/huggingface/token"
INFO [aicirt::bintokens] loading tokenizer: microsoft/phi-1_5
INFO [hf_hub] Token file not found "/root/.cache/huggingface/token"
Listening at http://0.0.0.0:4242
INFO [hf_hub] Token file not found "/root/.cache/huggingface/token"
INFO [hf_hub] Token file not found "/root/.cache/huggingface/token"
INFO [rllm_cuda::llm::loader] loading the model from https://huggingface.co/microsoft/phi-2/resolve/d3186761bf5c4409f7679359284066c25ab668ee/
INFO [actix_server::builder] starting 3 workers
INFO [actix_server::server] Actix runtime found; starting in Actix runtime
INFO [aicirt::bintokens] loading tokenizer: microsoft/phi-1_5
INFO [hf_hub] Token file not found "/root/.cache/huggingface/token"
model.safetensors.index.json [00:00:00] [██████████████████████████████████████████████] 23.72 KiB/23.72 KiB 208.11 KiB/s (0s)..del-00001-of-00002.safetensors [00:00:16] [████████████████████████████████████████████] 4.64 GiB/4.64 GiB 280.05 MiB/s (0s)..del-00002-of-00002.safetensors [00:00:02] [████████████████████████████████████████] 550.21 MiB/550.21 MiB 252.67 MiB/s (0s)INFO [rllm_cuda::llm::loader] building the model
INFO [rllm_cuda::llm::util] cuda mem: initial current: 0.000GiB, peak: 0.000GiB, allocated: 0.000GiB, freed: 0.000GiB
[00:00:01] ████████████████████████████████████████████████████████████  325/325  [00:00:00]                                  INFO [rllm_cuda::llm::loader] model loaded
INFO [rllm_cuda::llm::util] cuda mem: model fully loaded current: 5.196GiB, peak: 5.929GiB, allocated: 15.569GiB, freed: 10.373GiB
INFO [rllm_cuda::llm::paged::batch_info] profile: BatchInfo { step_no: 0, tokens: Tensor[[2048], Int], positions: Tensor[[2048], Int64], seqlens_q: [0, 1948], seqlens_k: [0, 1948], gather_mapping: 1948, slot_mapping: 2048, max_seqlen_q: 1948, max_seqlen_k: 1948, paged_block_tables: Tensor[[100, 13], Int], paged_context_lens: Tensor[[100], Int], paged_block_size: 16, paged_max_context_len: 204, seqlen_multi: 1, q_multi: 1948 }
INFO [rllm_cuda::llm::util] cuda mem: before model profile current: 5.196GiB, peak: 5.196GiB, allocated: 15.569GiB, freed: 10.373GiB
killing 3806
thread '<unnamed>' panicked at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tch-0.14.0/src/wrappers/tensor_generated.rs:17495:36:
called `Result::unwrap()` on an `Err` value: Torch("CUDA error: no kernel image is available for execution on the device\nCUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.\nFor debugging consider passing CUDA_LAUNCH_BLOCKING=1.\nCompile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.\n\nException raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first):\nframe #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x75dcee992617 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)\nframe #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x75dcee94d98d in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)\nframe #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x75dcf06859f8 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)\nframe #3: void at::native::gpu_kernel_impl<__nv_hdl_wrapper_t<false, true, false, __nv_dl_tag<void (*)(at::TensorIteratorBase&), &at::native::direct_copy_kernel_cuda, 7u>, long (long)> >(at::TensorIteratorBase&, __nv_hdl_wrapper_t<false, true, false, __nv_dl_tag<void (*)(at::TensorIteratorBase&), &at::native::direct_copy_kernel_cuda, 7u>, long (long)> const&) + 0x786 (0x75dcb36bee26 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)\nframe #4: void at::native::gpu_kernel<__nv_hdl_wrapper_t<false, true, false, __nv_dl_tag<void (*)(at::TensorIteratorBase&), &at::native::direct_copy_kernel_cuda, 7u>, long (long)> >(at::TensorIteratorBase&, __nv_hdl_wrapper_t<false, true, false, __nv_dl_tag<void (*)(at::TensorIteratorBase&), &at::native::direct_copy_kernel_cuda, 7u>, long (long)> const&) + 0x11b (0x75dcb36bf79b in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)\nframe #5: at::native::direct_copy_kernel_cuda(at::TensorIteratorBase&) + 0x338 (0x75dcb36aa0c8 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)\nframe #6: at::native::copy_device_to_device(at::TensorIterator&, bool, bool) + 0xccd (0x75dcb36aae4d in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)\nframe #7: <unknown function> + 0x1590e92 (0x75dcb36ace92 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)\nframe #8: <unknown function> + 0x1ac2ebf (0x75dc9cccaebf in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)\nframe #9: at::native::copy_(at::Tensor&, at::Tensor const&, bool) + 0x62 (0x75dc9cccc1f2 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)\nframe #10: at::_ops::copy_::call(at::Tensor&, at::Tensor const&, bool) + 0x15f (0x75dc9d9a54af in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)\nframe #11: at::native::_to_copy(at::Tensor const&, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, bool, c10::optional<c10::MemoryFormat>) + 0x1bd5 (0x75dc9cf9a7d5 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)\nframe #12: <unknown function> + 0x2b2f12b (0x75dc9dd3712b in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)\nframe #13: at::_ops::_to_copy::redispatch(c10::DispatchKeySet, at::Tensor const&, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, bool, c10::optional<c10::MemoryFormat>) + 0xf5 (0x75dc9d4a1425 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)\nframe #14: <unknown function> + 0x295e793 (0x75dc9db66793 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)\nframe #15: at::_ops::_to_copy::redispatch(c10::DispatchKeySet, at::Tensor const&, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, bool, c10::optional<c10::MemoryFormat>) + 0xf5 (0x75dc9d4a1425 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)\nframe #16: <unknown function> + 0x4020ecf (0x75dc9f228ecf in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)\nframe #17: <unknown function> + 0x402147e (0x75dc9f22947e in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)\nframe #18: at::_ops::_to_copy::call(at::Tensor const&, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, bool, c10::optional<c10::MemoryFormat>) + 0x1ee (0x75dc9d52894e in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)\nframe #19: at::native::to(at::Tensor const&, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, bool, bool, c10::optional<c10::MemoryFormat>) + 0x11b (0x75dc9cf9212b in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)\nframe #20: <unknown function> + 0x2d074d1 (0x75dc9df0f4d1 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)\nframe #21: at::_ops::to_dtype_layout::call(at::Tensor const&, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, bool, bool, c10::optional<c10::MemoryFormat>) + 0x203 (0x75dc9d6bcc13 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)\nframe #22: <unknown function> + 0x21fb75 (0x5c405b293b75 in /workspace/aici/target/release/rllm-cuda)\nframe #23: <unknown function> + 0x223a56 (0x5c405b297a56 in /workspace/aici/target/release/rllm-cuda)\nframe #24: <unknown function> + 0x206916 (0x5c405b27a916 in /workspace/aici/target/release/rllm-cuda)\nframe #25: <unknown function> + 0x202a72 (0x5c405b276a72 in /workspace/aici/target/release/rllm-cuda)\nframe #26: <unknown function> + 0x18da7a (0x5c405b201a7a in /workspace/aici/target/release/rllm-cuda)\nframe #27: <unknown function> + 0x1a419a (0x5c405b21819a in /workspace/aici/target/release/rllm-cuda)\nframe #28: <unknown function> + 0x1d163b (0x5c405b24563b in /workspace/aici/target/release/rllm-cuda)\nframe #29: <unknown function> + 0x135f4d (0x5c405b1a9f4d in /workspace/aici/target/release/rllm-cuda)\nframe #30: <unknown function> + 0x154fc2 (0x5c405b1c8fc2 in /workspace/aici/target/release/rllm-cuda)\nframe #31: <unknown function> + 0x81e305 (0x5c405b892305 in /workspace/aici/target/release/rllm-cuda)\nframe #32: <unknown function> + 0x94ac3 (0x75dc9ac31ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)\nframe #33: clone + 0x44 (0x75dc9acc2a04 in /usr/lib/x86_64-linux-gnu/libc.so.6)\n")
stack backtrace:
   0: rust_begin_unwind
             at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/std/src/panicking.rs:645:5
   1: core::panicking::panic_fmt
             at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/core/src/panicking.rs:72:14
   2: core::result::unwrap_failed
             at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/core/src/result.rs:1653:5
   3: core::result::Result<T,E>::unwrap
             at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/core/src/result.rs:1077:23
   4: tch::wrappers::tensor_generated::<impl tch::wrappers::tensor::Tensor>::totype
             at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tch-0.14.0/src/wrappers/tensor_generated.rs:17495:9
   5: tch_cuda::reshape_and_cache
             at /workspace/aici/rllm/tch-cuda/src/lib.rs:200:24
   6: rllm_cuda::llm::save_attn
             at ./src/llm/mod.rs:158:9
   7: rllm_cuda::llm::varlen_attn
             at ./src/llm/mod.rs:332:5
   8: rllm_cuda::llm::phi::MHA::forward
             at ./src/llm/phi.rs:147:17
   9: rllm_cuda::llm::phi::ParallelBlock::forward
             at ./src/llm/phi.rs:173:28
  10: <rllm_cuda::llm::phi::MixFormerSequentialForCausalLM as rllm_cuda::llm::tmodel::TModelInner>::forward
             at ./src/llm/phi.rs:215:18
  11: rllm_cuda::llm::loader::profile_model
             at ./src/llm/loader.rs:196:23
  12: rllm_cuda::llm::loader::load_rllm_engine
             at ./src/llm/loader.rs:173:22
  13: <rllm_cuda::llm::tmodel::TModel as rllm::exec::ModelExec>::load_rllm_engine
             at ./src/llm/tmodel.rs:61:9
  14: rllm::server::spawn_inference_loop::{{closure}}
             at /workspace/aici/rllm/rllm-base/src/server/mod.rs:473:13
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

This is running within a GCP VM with the following configuration: image

Steps to reproduce:

mmoskal commented 7 months ago

You'll need a GPU with compute capability 8.0 or later. I have honestly only tried A100.

You can try llama.cpp on cuda (./server.sh --cuda ... in rllm-llamacpp).

We definitely need a better error message.

vargonis commented 7 months ago

Didn't try with an A100, but the --cuda option for the llamacpp server works, thanks!