dbrowne commented 1 year ago

CARGO_PROFILE_RELEASE_BUILD_OVERRIDE_DEBUG=true warning: some crates are on edition 2021 which defaults to resolver = "2", but virtual workspaces default to resolver = "1" note: to keep the current resolver, specify workspace.resolver = "1" in the workspace root's manifest note: to use the edition 2021 resolver, specify workspace.resolver = "2" in the workspace root's manifest Compiling libc v0.2.147 Compiling autocfg v1.1.0 Compiling crossbeam-utils v0.8.16 Compiling proc-macro2 v1.0.66 Compiling unicode-ident v1.0.11 Compiling rayon-core v1.11.0 Compiling memchr v2.5.0 Compiling libm v0.2.7 Compiling cfg-if v1.0.0 Compiling pkg-config v0.3.27 Compiling paste v1.0.14 Compiling serde v1.0.183 Compiling serde_derive v1.0.183 Compiling scopeguard v1.2.0 Compiling syn v1.0.109 Compiling serde_json v1.0.104 Compiling seq-macro v0.3.5 Compiling vcpkg v0.2.15 Compiling crc32fast v1.3.2 Compiling ident_case v1.0.1 Compiling strsim v0.10.0 Compiling fnv v1.0.7 Compiling thiserror v1.0.44 Compiling either v1.9.0 Compiling glob v0.3.1 Compiling openssl v0.10.56 Compiling rustls v0.21.6 Compiling anyhow v1.0.72 Compiling cudarc v0.9.13 Compiling portable-atomic v1.4.2 Compiling native-tls v0.2.11 Compiling esaxx-rs v0.1.8 Compiling adler v1.0.2 Compiling rustix v0.38.7 Compiling gimli v0.27.3 Compiling macro_rules_attribute-proc_macro v0.1.3 Compiling rustc-demangle v0.1.23 Compiling miniz_oxide v0.7.1 Compiling heck v0.4.1 Compiling flate2 v1.0.26 Compiling memoffset v0.9.0 Compiling crossbeam-epoch v0.9.15 Compiling num-traits v0.2.16 Compiling zip v0.6.6 Compiling crossbeam-channel v0.5.8 Compiling aho-corasick v1.0.2 Compiling object v0.31.1 Compiling nom v7.1.3 Compiling aho-corasick v0.7.20 Compiling quote v1.0.32 Compiling macro_rules_attribute v0.1.3 Compiling syn v2.0.28 Compiling crossbeam-deque v0.8.3 Compiling num_cpus v1.16.0 Compiling getrandom v0.2.10 Compiling dirs-sys v0.4.1 Compiling console v0.15.7 Compiling memmap2 v0.7.1 Compiling regex-automata v0.3.6 Compiling cc v1.0.82 Compiling dirs v5.0.1 Compiling rand_core v0.6.4 Compiling num-complex v0.4.3 Compiling rand_chacha v0.3.1 Compiling indicatif v0.17.6 Compiling rand v0.8.5 Compiling addr2line v0.20.0 Compiling rayon v1.7.0 Compiling is-terminal v0.4.9 Compiling ring v0.16.20 Compiling openssl-sys v0.9.91 Compiling rand_distr v0.4.3 Compiling backtrace v0.3.68 Compiling onig_sys v69.8.1 Compiling anstream v0.3.2 Compiling clap_builder v4.3.21 Compiling half v2.3.1 Compiling spm_precompiled v0.1.4 Compiling regex v1.9.3 Compiling darling_core v0.14.4 Compiling fancy-regex v0.10.0 Compiling candle-kernels v0.1.0 (/mnt/source1/djbGR/ruststuffs/candle/candle-kernels) Compiling candle-gemm-common v0.15.5 Compiling rayon-cond v0.1.0 Compiling candle-gemm-f32 v0.15.5 Compiling candle-gemm-f64 v0.15.5 Compiling candle-gemm-c64 v0.15.5 Compiling candle-gemm-c32 v0.15.5 Compiling safetensors v0.3.2 Compiling candle-examples v0.1.0 (/mnt/source1/djbGR/ruststuffs/candle/candle-examples) Compiling tracing-chrome v0.7.1 Compiling candle-gemm-f16 v0.15.5 error: failed to run custom build command for candle-kernels v0.1.0 (/mnt/source1/djbGR/ruststuffs/candle/candle-kernels)

Caused by: process didn't exit successfully: /mnt/source1/djbGR/ruststuffs/candle/target/release/build/candle-kernels-e21ab5b8e8daaf0a/build-script-build (exit status: 101) --- stdout cargo:rerun-if-changed=build.rs cargo:rustc-env=CUDA_INCLUDE_DIR=/usr/local/cuda/include cargo:rerun-if-changed=src/ cargo:rerun-if-env-changed=CUDA_COMPUTE_CAP cargo:rustc-env=CUDA_COMPUTE_CAP=sm_61

--- stderr src/compatibility.cuh(19): error: function "hmax_nan(half, half)" has already been defined attribute((device)) inline attribute((always_inline)) half __hmax_nan(half a, half b) { ^

src/compatibility.cuh(22): error: function "hmin_nan(half, half)" has already been defined attribute((device)) inline attribute((always_inline)) half __hmin_nan(half a, half b) { ^

src/compatibility.cuh(19): error: function "hmax_nan(half, half)" has already been defined attribute((device)) inline attribute((always_inline)) half __hmax_nan(half a, half b) { ^

src/compatibility.cuh(22): error: function "hmin_nan(half, half)" has already been defined attribute((device)) inline attribute((always_inline)) half __hmin_nan(half a, half b) { ^

src/compatibility.cuh(19): error: function "hmax_nan(half, half)" has already been defined attribute((device)) inline attribute((always_inline)) half __hmax_nan(half a, half b) { ^

src/compatibility.cuh(22): error: function "hmin_nan(half, half)" has already been defined attribute((device)) inline attribute((always_inline)) half __hmin_nan(half a, half b) { ^

src/compatibility.cuh(19): error: function "hmax_nan(half, half)" has already been defined attribute((device)) inline attribute((always_inline)) half __hmax_nan(half a, half b) { ^

src/compatibility.cuh(22): error: function "hmin_nan(half, half)" has already been defined attribute((device)) inline attribute((always_inline)) half __hmin_nan(half a, half b) { ^

src/compatibility.cuh(19): error: function "hmax_nan(half, half)" has already been defined attribute((device)) inline attribute((always_inline)) half __hmax_nan(half a, half b) { ^

src/compatibility.cuh(22): error: function "hmin_nan(half, half)" has already been defined attribute((device)) inline attribute((always_inline)) half __hmin_nan(half a, half b) { ^

2 errors detected in the compilation of "src/indexing.cu". src/compatibility.cuh(19): error: function "hmax_nan(half, half)" has already been defined attribute((device)) inline attribute((always_inline)) half __hmax_nan(half a, half b) { ^

src/compatibility.cuh(22): error: function "hmin_nan(half, half)" has already been defined attribute((device)) inline attribute((always_inline)) half __hmin_nan(half a, half b) { ^

2 errors detected in the compilation of "src/affine.cu". src/compatibility.cuh(19): error: function "hmax_nan(half, half)" has already been defined attribute((device)) inline attribute((always_inline)) half __hmax_nan(half a, half b) { ^

src/compatibility.cuh(22): error: function "hmin_nan(half, half)" has already been defined attribute((device)) inline attribute((always_inline)) half __hmin_nan(half a, half b) { ^

2 errors detected in the compilation of "src/cast.cu". 2 errors detected in the compilation of "src/reduce.cu". 2 errors detected in the compilation of "src/conv.cu". src/compatibility.cuh(19): error: function "hmax_nan(half, half)" has already been defined attribute((device)) inline attribute((always_inline)) half __hmax_nan(half a, half b) { ^

src/compatibility.cuh(22): error: function "hmin_nan(half, half)" has already been defined attribute((device)) inline attribute((always_inline)) half __hmin_nan(half a, half b) { ^

2 errors detected in the compilation of "src/ternary.cu". 2 errors detected in the compilation of "src/unary.cu". 2 errors detected in the compilation of "src/binary.cu". thread 'main' panicked at 'nvcc error while compiling "src/affine.cu":

stdout

stderr

', candle-kernels/build.rs:207:13 stack backtrace: 0: 0x557f8498d0b1 - std::backtrace_rs::backtrace::libunwind::trace::hb01a67340c9cfb71 at /rustc/39f42ad9e8430a8abb06c262346e89593278c515/library/std/src/../../backtrace/src/backtrace/libunwind.rs:93:5 1: 0x557f8498d0b1 - std::backtrace_rs::backtrace::trace_unsynchronized::h896aca561948c930 at /rustc/39f42ad9e8430a8abb06c262346e89593278c515/library/std/src/../../backtrace/src/backtrace/mod.rs:66:5 2: 0x557f8498d0b1 - std::sys_common::backtrace::_print_fmt::h8627be5b68fbde29 at /rustc/39f42ad9e8430a8abb06c262346e89593278c515/library/std/src/sys_common/backtrace.rs:65:5 3: 0x557f8498d0b1 - ::fmt::h1b7758da45f4cd22 at /rustc/39f42ad9e8430a8abb06c262346e89593278c515/library/std/src/sys_common/backtrace.rs:44:22 4: 0x557f849b282c - core::fmt::rt::Argument::fmt::h0eb38586043a01ca at /rustc/39f42ad9e8430a8abb06c262346e89593278c515/library/core/src/fmt/rt.rs:138:9 5: 0x557f849b282c - core::fmt::write::h68b52f8aa598961e at /rustc/39f42ad9e8430a8abb06c262346e89593278c515/library/core/src/fmt/mod.rs:1094:21 6: 0x557f8498949e - std::io::Write::write_fmt::hc5568929b662da92 at /rustc/39f42ad9e8430a8abb06c262346e89593278c515/library/std/src/io/mod.rs:1714:15 7: 0x557f8498cec5 - std::sys_common::backtrace::_print::h65aecbff12ca83c8 at /rustc/39f42ad9e8430a8abb06c262346e89593278c515/library/std/src/sys_common/backtrace.rs:47:5 8: 0x557f8498cec5 - std::sys_common::backtrace::print::hf75ac9d60598d247 at /rustc/39f42ad9e8430a8abb06c262346e89593278c515/library/std/src/sys_common/backtrace.rs:34:9 9: 0x557f8498e483 - std::panicking::default_hook::{{closure}}::hc2cb8da3be7476b0 at /rustc/39f42ad9e8430a8abb06c262346e89593278c515/library/std/src/panicking.rs:269:22 10: 0x557f8498e19d - std::panicking::default_hook::hefa49c86da66275b at /rustc/39f42ad9e8430a8abb06c262346e89593278c515/library/std/src/panicking.rs:288:9 11: 0x557f8498ea09 - std::panicking::rust_panic_with_hook::hd4c3b0056ba96951 at /rustc/39f42ad9e8430a8abb06c262346e89593278c515/library/std/src/panicking.rs:705:13 12: 0x557f8498e907 - std::panicking::begin_panic_handler::{{closure}}::he487675683e9a525 at /rustc/39f42ad9e8430a8abb06c262346e89593278c515/library/std/src/panicking.rs:597:13 13: 0x557f8498d516 - std::sys_common::backtrace::rust_end_short_backtrace::hcff58b9b81620321 at /rustc/39f42ad9e8430a8abb06c262346e89593278c515/library/std/src/sys_common/backtrace.rs:151:18 14: 0x557f8498e652 - rust_begin_unwind at /rustc/39f42ad9e8430a8abb06c262346e89593278c515/library/std/src/panicking.rs:593:5 15: 0x557f848b9333 - core::panicking::panic_fmt::h1b81548733a03bd5 at /rustc/39f42ad9e8430a8abb06c262346e89593278c515/library/core/src/panicking.rs:67:14 16: 0x557f848c3323 - build_script_build::cuda::build_ptx::ha488acce3cd701b3 at /mnt/source1/djbGR/ruststuffs/candle/candle-kernels/build.rs:207:13 17: 0x557f848c0878 - build_script_build::main::h2523e6c20b65fa04 at /mnt/source1/djbGR/ruststuffs/candle/candle-kernels/build.rs:6:33 18: 0x557f848d40cb - core::ops::function::FnOnce::call_once::h385ddf31127d3e12 at /rustc/39f42ad9e8430a8abb06c262346e89593278c515/library/core/src/ops/function.rs:250:5 19: 0x557f848ccbae - std::sys_common::backtrace::rust_begin_short_backtrace::h1cfd550c72c3e194 at /rustc/39f42ad9e8430a8abb06c262346e89593278c515/library/std/src/sys_common/backtrace.rs:135:18 20: 0x557f848e0130 - std::rt::lang_start::{{closure}}::h70dc5fa7783a03f7 at /rustc/39f42ad9e8430a8abb06c262346e89593278c515/library/std/src/rt.rs:166:18 21: 0x557f8498541b - core::ops::function::impls::<impl core::ops::function::FnOnce for &F>::call_once::h9eccf02cf11756f6 at /rustc/39f42ad9e8430a8abb06c262346e89593278c515/library/core/src/ops/function.rs:284:13 22: 0x557f8498541b - std::panicking::try::do_call::hc95b838862bbb45a at /rustc/39f42ad9e8430a8abb06c262346e89593278c515/library/std/src/panicking.rs:500:40 23: 0x557f8498541b - std::panicking::try::h82935254d12a76fc at /rustc/39f42ad9e8430a8abb06c262346e89593278c515/library/std/src/panicking.rs:464:19 24: 0x557f8498541b - std::panic::catch_unwind::h7fd9d11cd70fc350 at /rustc/39f42ad9e8430a8abb06c262346e89593278c515/library/std/src/panic.rs:142:14 25: 0x557f8498541b - std::rt::lang_start_internal::{{closure}}::h0ddb191e68b650a4 at /rustc/39f42ad9e8430a8abb06c262346e89593278c515/library/std/src/rt.rs:148:48 26: 0x557f8498541b - std::panicking::try::do_call::h17d4693c7a6e120c at /rustc/39f42ad9e8430a8abb06c262346e89593278c515/library/std/src/panicking.rs:500:40 27: 0x557f8498541b - std::panicking::try::h684fc020e1305912 at /rustc/39f42ad9e8430a8abb06c262346e89593278c515/library/std/src/panicking.rs:464:19 28: 0x557f8498541b - std::panic::catch_unwind::h757da538db515116 at /rustc/39f42ad9e8430a8abb06c262346e89593278c515/library/std/src/panic.rs:142:14 29: 0x557f8498541b - std::rt::lang_start_internal::ha6b1625a1e9a4f5b at /rustc/39f42ad9e8430a8abb06c262346e89593278c515/library/std/src/rt.rs:148:20 30: 0x557f848e010a - std::rt::lang_start::h0d1360f20fc735dd at /rustc/39f42ad9e8430a8abb06c262346e89593278c515/library/std/src/rt.rs:165:17 31: 0x557f848c43fe - main 32: 0x7fd8be429d90 - libc_start_call_main at ./csu/../sysdeps/nptl/libc_start_call_main.h:58:16 33: 0x7fd8be429e40 - libc_start_main_impl at ./csu/../csu/libc-start.c:392:3 34: 0x557f848b9a15 - _start 35: 0x0 -

Dominically commented 1 year ago

What OS and CUDA version are you using? I seem to be having a similar issue (with loads of C/CU/C++ errors) with both the crates.io and the github versions on Windows.

This is my cargo.toml:

[package]
name = "candle_test"
version = "0.1.0"
edition = "2021"

# See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html

[dependencies]
# candle-core = {git = "https://github.com/huggingface/candle.git", branch = "main", features = ["cuda"]}
candle-core = {features = ["cuda"], version = "0.1.0"}
# candle-nn = {git = "https://github.com/huggingface/candle.git", branch = "main"}

Rust version is 1.71.0. I'm running CUDA 11.7, which could be the problem, but I can't see anything about what version I should use.

dbrowne commented 1 year ago

'Linux version 6.2.0-26-generic (buildd@bos03-amd64-042) (x86_64-linux-gnu-gcc-11 (Ubuntu 11.3.0-1ubuntu1~22.04.1) 11.3.0, GNU ld (GNU Binutils for Ubuntu) 2.38) #26~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Thu Jul 13 16:27:29 UTC 2

rustc 1.73.0-nightly (39f42ad9e 2023-07-19)

Cuda compilation tools, release 12.2, V12.2.128
Build cuda_12.2.r12.2/compiler.33053471_0'

n8henrie commented 1 year ago

Also seeing thread 'main' panicked at 'nvcc error while compiling "src/affine.cu": on arch.

$ rustc --version
rustc 1.71.0 (8ede3aae2 2023-07-12)
$ pacman -Qi cuda
Name            : cuda
Version         : 12.2.0-1
Description     : NVIDIA's GPU programming toolkit
Architecture    : x86_64
URL             : https://developer.nvidia.com/cuda-zone
Licenses        : custom:NVIDIA
Groups          : None
Provides        : cuda-toolkit  cuda-sdk  libcudart.so=12-64  libcublas.so=12-64  libcublas.so=12-64
                  libcusolver.so=11-64  libcusolver.so=11-64  libcusparse.so=12-64  libcusparse.so=12-64
Depends On      : opencl-nvidia  nvidia-utils  python  gcc12
Optional Deps   : gdb: for cuda-gdb [installed]
                  glu: required for some profiling tools in CUPTI [installed]
Required By     : cudnn  magma-cuda  python-pycuda  python-pytorch-cuda  python-tensorflow-opt-cuda
                  tensorflow-opt-cuda
Optional For    : meshroom-bin  openmpi
Conflicts With  : None
Replaces        : cuda-toolkit  cuda-sdk  cuda-static
Installed Size  : 4.36 GiB
Packager        : Sven-Hendrik Haase <svenstaro@archlinux.org>
Build Date      : Sun 02 Jul 2023 01:59:36 PM MDT
Install Date    : Sun 16 Jul 2023 07:35:25 AM MDT
Install Reason  : Installed as a dependency for another package
Install Script  : Yes
Validated By    : Signature

Narsil commented 1 year ago

What cards are you guys having ?

We need compute_cap>7.0 for it to work. I know compute_cap 5.2 does trigger similar fails.

The core kernels we have use f16 and bf16 and those old cards cannot compile them properly. I added some flags for some options but I didn't check all potential caps yet.

krolinventions commented 1 year ago

I could get my own project to compile by specifying the version on the command line. ~~To see what your system supports use nvcc --list-gpu-code~~

CUDA_COMPUTE_CAP=90 cargo build --release --features cuda

However, during runtime I get DriverError(CUDA_ERROR_NOT_SUPPORTED, "operation not supported"). So GPU acceleration does not work for me. It looks like that's something else though, as this also doesn't work for me: https://askubuntu.com/a/1215237. That's not related to candle at all, so no need to fix that in this thread.

krolinventions commented 1 year ago

Ok, did some more investigation. It does turn out that my device only supports up to 50. So I can get pure CUDA code to run if I compile with nvcc -arch=sm_50. I can also get my application that uses candle to compile with that, but it still gives me the driver error, so I guess it's not supported.

My GPU (Quadro M620) runs torch fine, so would be great if candle could add support for it! Mainly because it's a nice laptop to develop on.

Narsil commented 1 year ago

@krolinventions Perfectly understand. My own GTX 970 is too old to run candle atm.

However, in order to deliver fast we had to cut corners in that department. Currently I think I would like to focus on giving a good error message before actually writing kernels that work on old hardware. I may do it on my spare time to be able to use my old GPU, but I know the time it takes.

If you want to take a stab at it, you're more than welcome !

krolinventions commented 1 year ago

@Narsil Actually, I think just using the CPU for development is actually fine. It's great to not have to deal with installing all that extra stuff, like with torch, or the cuda libraries.

On looking at the kernels: I have never used CUDA before but from the few examples that I've seen it looks rather nice. I think I may need a little more experience with it before tackling these, but maybe!

Narsil commented 1 year ago

Do try, it's not as daunting as it looks (it's daunting when you want the best possible performance). Feel free to join the Discord HF on channel candle to pursue the discussion.

I'lll keep the issue open to give better error message.

n8henrie commented 1 year ago

We need compute_cap>7.0 for it to work.

I'm using a 1080TI

$ nvidia-smi --query-gpu=compute_cap --format=csv,noheader
6.1

😢

Dominically commented 1 year ago

What cards are you guys having ?

Just tried compiling with RTX 3060 (compute = 8.6) and CUDA 12.2 on Windows and still getting a massive error log of C errors.

e.g.:

error: asm operand type size(8) does not match type/size implied by constraint 'r'
    static __declspec(__device__) __inline longlong2 __ldg(const longlong2 *ptr) { longlong2 ret; asm volatile ("ld.global.nc.v2.s64 {%0,%1}, [%2];"  : "=l"(ret.x), "=l"(ret.y) : "r" (ptr)); return ret; }

along with other errors that are spammed a lot of times.

Narsil commented 1 year ago

Yes Windows seems to be having issues. I've been told in discord WSL is ok.

dbrowne commented 1 year ago

What cards are you guys having ?

We need compute_cap>7.0 for it to work. I know compute_cap 5.2 does trigger similar fails.

The core kernels we have use f16 and bf16 and those old cards cannot compile them properly. I added some flags for some options but I didn't check all potential caps yet.

My compute cap is 6.1 It is a stretch but I'm willing to work on this if you point me in the right direction.

Narsil commented 1 year ago

@dbrowne

Go to candle/candle-kernels/src/

And try to make the .cu compile:

nvcc --ptx --gpu-architecture=sm_61 affine.cu -I.

Most of the logic should be in compatibility.cuh. 61 should be easier than 52 and earlier.

Is seems I have a fix for 61

Narsil commented 1 year ago

Can you take my PR out for a spin ?

https://github.com/huggingface/candle/pull/386

It fixes compilation but it still doesn't work on my 52 because the ops are still not there. However once you have the PTX you can test out of candle and debug by more classical means to try and understand why it compiles but fails to run.

n8henrie commented 1 year ago

Looks like that PR has been merged!

On current master, my 1080TI now works like a charm, thank you!

On my threadripper takes 13-15s for the example:

$ time cargo run --example whisper --release 
    Finished release [optimized] target(s) in 0.28s
     Running `target/release/examples/whisper`
Running on CPU, to run on GPU, build this example with `--features cuda`
No audio file submitted: Downloading https://huggingface.co/datasets/Narsil/candle_demo/blob/main/samples_jfk.wav
loaded wav data: Header { audio_format: 1, channel_count: 1, sampling_rate: 16000, bytes_per_second: 32000, bytes_per_sample: 2, bits_per_sample: 16 }
pcm data loaded 176000
loaded mel: [1, 80, 3000]
audio features: [1, 1500, 384]
3000: Segment { start: 0.0, duration: 30.0, dr: DecodingResult { tokens: [50257, 50363, 843, 523, 616, 5891, 3399, 1265, 407, 644, 534, 1499, 460, 466, 329, 345, 1265, 644, 345, 460, 466, 329, 534, 1499, 13, 50903, 50256], text: " And so my fellow Americans ask not what your country can do for you ask what you can do for your country.", avg_logprob: -0.3303277552190798, no_speech_prob: 0.017772182822227478, temperature: 0.0, compression_ratio: NaN } }, in 15.577960389s

real    0m17.062s
user    0m12.536s
sys 0m3.221s

Enabling the cuda feature takes it well below a second (~2 seconds total runtime). Wow!

$ time cargo run --example whisper --release --features cuda
    Finished release [optimized] target(s) in 0.31s
     Running `target/release/examples/whisper`
No audio file submitted: Downloading https://huggingface.co/datasets/Narsil/candle_demo/blob/main/samples_jfk.wav
loaded wav data: Header { audio_format: 1, channel_count: 1, sampling_rate: 16000, bytes_per_second: 32000, bytes_per_sample: 2, bits_per_sample: 16 }
pcm data loaded 176000
loaded mel: [1, 80, 3000]
audio features: [1, 1500, 384]
3000: Segment { start: 0.0, duration: 30.0, dr: DecodingResult { tokens: [50257, 50363, 843, 523, 616, 5891, 3399, 1265, 407, 644, 534, 1499, 460, 466, 329, 345, 1265, 644, 345, 460, 466, 329, 534, 1499, 13, 50903, 50256], text: " And so my fellow Americans ask not what your country can do for you ask what you can do for your country.", avg_logprob: -0.3305633301574319, no_speech_prob: 0.017772099003195763, temperature: 0.0, compression_ratio: NaN } }, in 300.960168ms

real    0m2.166s
user    0m1.582s
sys 0m0.433s

CUDA-backed NNs in Rust!? This is really exciting :)

Narsil commented 1 year ago

@n8henrie This is far from optimized yet ;).

We ran a few passes, but there's still a lot more that can be done

dbrowne commented 1 year ago

@dbrowne

Go to candle/candle-kernels/src/

And try to make the .cu compile:
nvcc --ptx --gpu-architecture=sm_61 affine.cu -I.
Most of the logic should be in compatibility.cuh. 61 should be easier than 52 and earlier.

Is seems I have a fix for 61

I'm awaiting delivery of a RTX a4500. If it does not work in my workstation I will begin in earnest to pursue this.

Narsil commented 1 year ago

Does it work now on main ? I made fixes for older cards (still far from universal support but should be much better)

dbrowne commented 1 year ago

Does it work now on main ? I made fixes for older cards (still far from universal support but should be much better)

Yes

krolinventions commented 1 year ago

@Narsil Also works for me (Quadro M620). Both the examples and my own code. Thanks!

ViliamVadocz commented 1 year ago

If people in the future have similar CUDA compilation errors (functions already being defined, etc.), it's because the compatilibty header compatibility.cuh is not perfect. Not all graphics cards and all driver versions have the same functions, which is why compatibility.cuh attempts to emulate them. Unfortunately, which gpus and which drivers introduce each function is not well documented by NVIDIA, so several guesses were made regarding when to emulate things.

If you get such an error, please report it here (or as an issue on candle or dfdx). You can use these commands to give helpful context:

> nvidia-smi --query-gpu=name,compute_cap,driver_version --format=csv
name, compute_cap, driver_version
NVIDIA T500, 7.5, 536.25

> nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Jun_13_19:42:34_Pacific_Daylight_Time_2023
Cuda compilation tools, release 12.2, V12.2.91
Build cuda_12.2.r12.2/compiler.32965470_0

For more information:

Compute capability explanation: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#compute-capabilities
Tables showing compute capability for each GPU: https://developer.nvidia.com/cuda-gpus

bayedieng commented 1 year ago

Getting a similar error using a 2080 ti with cuda version 12.2, driver 535.86.05 on POP OS 22.04.

compatibility.cuh(11): error: identifier "__hmax" is undefined

compatibility.cuh(11): error: ambiguous "?" operation: second operand of type "__half" can be converted to third operand type "<error-type>", and vice versa

compatibility.cuh(11): error: ambiguous "?" operation: second operand of type "__half" can be converted to third operand type "<error-type>", and vice versa

compatibility.cuh(14): error: identifier "__hmin" is undefined

compatibility.cuh(14): error: ambiguous "?" operation: second operand of type "__half" can be converted to third operand type "<error-type>", and vice versa

compatibility.cuh(14): error: ambiguous "?" operation: second operand of type "__half" can be converted to third operand type "<error-type>", and vice versa

Might be a result of the cuda toolkit being an older version. Pop OS does not have a newer driver compatible with the latest cuda toolkit yet. Hopefully driver differences won't matter much in the future.

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Thu_Nov_18_09:45:30_PST_2021
Cuda compilation tools, release 11.5, V11.5.119
Build cuda_11.5.r11.5/compiler.30672275_0

theHausdorffMetric commented 1 year ago

Can't compile with cuda feature. Does the following info help (anything a clueless cuda newbie could do)?

$ nvcc --ptx --gpu-architecture=sm_61 affine.cu -I.
compatibility.cuh(11): error: identifier "__hmax" is undefined

compatibility.cuh(11): error: ambiguous "?" operation: second operand of type "__half" can be converted to third operand type "<error-type>", and vice versa

compatibility.cuh(11): error: ambiguous "?" operation: second operand of type "__half" can be converted to third operand type "<error-type>", and vice versa

compatibility.cuh(14): error: identifier "__hmin" is undefined

compatibility.cuh(14): error: ambiguous "?" operation: second operand of type "__half" can be converted to third operand type "<error-type>", and vice versa

compatibility.cuh(14): error: ambiguous "?" operation: second operand of type "__half" can be converted to third operand type "<error-type>", and vice versa

6 errors detected in the compilation of "affine.cu".

$ nvidia-smi --query-gpu=name,compute_cap,driver_version --format=csv
name, compute_cap, driver_version
NVIDIA GeForce GTX 1080, 6.1, 535.86.05

$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Thu_Nov_18_09:45:30_PST_2021
Cuda compilation tools, release 11.5, V11.5.119
Build cuda_11.5.r11.5/compiler.30672275_0

$ uname -ar
Linux visi2 5.19.0-41-generic #42~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Tue Apr 18 17:40:00 UTC 2 x86_64 x86_64 x86_64 GNU/Linux

Narsil commented 1 year ago

@bayedieng @theHausdorffMetric

compatibility.cuh(11): error: identifier "__hmax" is undefined

Yes this means cuda 11.5 doesn't have this function, therefore the compat layer doesn't work.

Upgrading cuda should help, at least 11.8.

GeauxEric commented 1 year ago

OK, not sure if this is an edge case. I was trying out candle on nvidia jetson nano. candle failed to detect cuda because it uses nvidia-smi while jetson uses tegrastats

tezlm commented 1 year ago

still doesn't work on main (4abc1ea34dbc834e561f442737faf2c735f0a6ce), here are yet more error messages

$ nvcc --ptx --gpu-architecture=sm_61 affine.cu -I.

compatibility.cuh(11): error: identifier "__hmax" is undefined

compatibility.cuh(11): error: ambiguous "?" operation: second operand of type "__half" can be converted to third operand type "<error-type>", and vice versa

compatibility.cuh(11): error: ambiguous "?" operation: second operand of type "__half" can be converted to third operand type "<error-type>", and vice versa

compatibility.cuh(14): error: identifier "__hmin" is undefined

compatibility.cuh(14): error: ambiguous "?" operation: second operand of type "__half" can be converted to third operand type "<error-type>", and vice versa

compatibility.cuh(14): error: ambiguous "?" operation: second operand of type "__half" can be converted to third operand type "<error-type>", and vice versa

6 errors detected in the compilation of "affine.cu".

$ nvidia-smi --query-gpu=name,compute_cap,driver_version --format=csv
name, compute_cap, driver_version
NVIDIA GeForce RTX 2060, 7.5, 535.104.05

$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Thu_Nov_18_09:45:30_PST_2021
Cuda compilation tools, release 11.5, V11.5.119
Build cuda_11.5.r11.5/compiler.30672275_0

$ uname -a
Linux chorusfruit 6.2.6-76060206-generic #202303130630~1689015125~22.04~ab2190e SMP PREEMPT_DYNAMIC Mon J x86_64 x86_64 x86_64 GNU/Linux

Narsil commented 1 year ago

try with Cuda >=12

dashdeckers commented 1 year ago

I also have an issue getting candle to utilize the GPU.

$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Aug_15_22:02:13_PDT_2023
Cuda compilation tools, release 12.2, V12.2.140
Build cuda_12.2.r12.2/compiler.33191640_0

$ nvidia-smi --query-gpu=name,compute_cap,driver_version --format=csv
name, compute_cap, driver_version
NVIDIA GeForce RTX 2070, 7.5, 535.104.12

$ uname -a
Linux xxx 5.4.0-164-generic #181-Ubuntu SMP Fri Sep 1 13:41:22 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

I get this error when I run the Mistral example:

$ cargo run --example mistral --features cuda --release -- --prompt "Here is a sample quick sort implementation in rust " --quantized -n 400
avx: true, neon: false, simd128: false, f16c: true
temp: 0.00 repeat-penalty: 1.10 repeat-last-n: 64
retrieved the files in 128.695µs
loaded the model in 2.726856172s
Here is a sample quick sort implementation in rust Illegal instruction (core dumped)

When I edit ~/candle/.cargo/config.toml to build with the flags rustflags = ["-C", "target-cpu=native", "-C", "target-feature=-avx,-avx2"] (basically I copy and insert the build flags for [target.x86_64-apple-darwin], inspired by your suggestion in Issue #622) it runs fine on the CPU.

LaurentMazare commented 1 year ago

I doubt that it would be cuda related as the quantized models are supposed to be always on the cpu at the moment. Could you try to run with the exact same setup than when it crashes but remove the --features cuda? And keep the features flag and add --cpu. Finaly if you could launch this in a gdb and send back the backtrace that could be very useful (and in this case better to compile with the debug symbols, i.e. replace --release with --profile=release-with-debug). Thanks

dashdeckers commented 1 year ago

Aaah okay it's not supposed to run on the GPU anyway!

I'm having some issues with my machine and just gonna take the rest of the day off, but I will test your suggestions in the next few days and report back. I have no experience with debuggers, but I'm sure it's straightforward :)

dashdeckers commented 1 year ago

Here are the backtraces as chewed out by rust-gdb:

When built with the "vanilla" command from the example:

cargo run --example mistral --features cuda --profile=release-with-debug -- --prompt "Here is a sample quick sort implementation in rust " --quantized -n 400

gdb-vanilla-command.txt

When built with the same command but without the --features cuda flag:

cargo run --example mistral --profile=release-with-debug -- --prompt "Here is a sample quick sort implementation in rust " --quantized -n 400

gdb-without-cuda.txt

When built with the same ("vanilla") command but adding the --cpu flag:

cargo run --example mistral --features cuda --profile=release-with-debug -- --prompt "Here is a sample quick sort implementation in rust " --quantized -n 400 --cpu

gdb-with-cpu.txt

If I forgot / messed up anything or you'd like to see another one, do let me know and I'll fire it up again!

dashdeckers commented 1 year ago

Back to the CUDA-related issue, when I run the "vanilla" command without the --quantized flag, I get the following error:

$ RUST_BACKTRACE=1 cargo run --example mistral --features cuda --profile=release-with-debug -- --prompt "Here is a sample quick sort implementation in rust " -n 400
    Finished release-with-debug [optimized + debuginfo] target(s) in 0.20s
     Running `target/release-with-debug/examples/mistral --prompt 'Here is a sample quick sort implementation in rust ' -n 400`
avx: true, neon: false, simd128: false, f16c: true
temp: 0.00 repeat-penalty: 1.10 repeat-last-n: 64
retrieved the files in 145.215µs
Error: DriverError(CUDA_ERROR_NOT_FOUND, "named symbol not found") when loading cast_f32_bf16

Stack backtrace:
   0: <core::result::Result<T,F> as core::ops::try_trait::FromResidual<core::result::Result<core::convert::Infallible,E>>>::from_residual
             at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/core/src/result.rs:1962:27
      mistral::main
             at ./candle-examples/examples/mistral/main.rs:253:21
   1: core::ops::function::FnOnce::call_once
             at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/core/src/ops/function.rs:250:5
      std::sys_common::backtrace::__rust_begin_short_backtrace
             at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/sys_common/backtrace.rs:154:18
   2: std::rt::lang_start::{{closure}}
             at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/rt.rs:166:18
   3: core::ops::function::impls::<impl core::ops::function::FnOnce<A> for &F>::call_once
             at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/core/src/ops/function.rs:284:13
      std::panicking::try::do_call
             at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/panicking.rs:502:40
      std::panicking::try
             at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/panicking.rs:466:19
      std::panic::catch_unwind
             at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/panic.rs:142:14
      std::rt::lang_start_internal::{{closure}}
             at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/rt.rs:148:48
      std::panicking::try::do_call
             at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/panicking.rs:502:40
      std::panicking::try
             at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/panicking.rs:466:19
      std::panic::catch_unwind
             at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/panic.rs:142:14
      std::rt::lang_start_internal
             at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/rt.rs:148:20
   4: main
   5: __libc_start_main
             at /build/glibc-BHL3KM/glibc-2.31/csu/../csu/libc-start.c:308:16
   6: _start

For some reason rust-gdb doesn't show me much in this case so I used the RUST_BACKTRACE variable.

Is the problem with my CUDA installation?

LaurentMazare commented 1 year ago

This last issue boils down to the following part of your error message:

Error: DriverError(CUDA_ERROR_NOT_FOUND, "named symbol not found") when loading cast_f32_bf16

This is likely because your gpu is not recent enough to support bf16 and mistral is a bf16 model.

dashdeckers commented 1 year ago

I tried instead running the Falcon model with the --use-f32 flag but got the same error:

$ cargo run --example falcon --features cuda --release -- --prompt "Here is a sample quick sort implementation in rust " --use-f32
    Finished release [optimized] target(s) in 0.18s
     Running `/home/travis/candle/target/release/examples/falcon --prompt 'Here is a sample quick sort implementation in rust ' --use-f32`
retrieved the files in 142.123µs
Error: DriverError(CUDA_ERROR_NOT_FOUND, "named symbol not found") when loading cast_bf16_f32

The T5 model with --features cuda does indeed run on my GPU however, so that's great! As far as I'm concerned, the sanity-test passed.

jonatino commented 11 months ago

error: asm operand type size(8) does not match type/size implied by constraint 'r'

Same here with rtx 4090. Just thousands of errors on windows 11

shenshouer commented 11 months ago

Error: DriverError(CUDA_ERROR_NOT_FOUND, "named symbol not found") when loading cast_f32_bf16 at windows 11 wsl2 ubuntu 22.04

sope@DESKTOP-HNB502N:/mnt/c/Users/Administrator$ nvidia-smi --query-gpu=compute_cap --format=csv,noheader
7.5
sope@DESKTOP-HNB502N:/mnt/c/Users/Administrator$ nvidia-smi --query-gpu=name,compute_cap,driver_version --format=csv
name, compute_cap, driver_version
NVIDIA GeForce RTX 2080 SUPER, 7.5, 546.12
sope@DESKTOP-HNB502N:/mnt/c/Users/Administrator$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Fri_Nov__3_17:16:49_PDT_2023
Cuda compilation tools, release 12.3, V12.3.103
Build cuda_12.3.r12.3/compiler.33492891_0
sope@DESKTOP-HNB502N:/mnt/c/Users/Administrator$ uname -a
Linux DESKTOP-HNB502N 5.15.133.1-microsoft-standard-WSL2 #1 SMP Thu Oct 5 21:02:42 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

sope@DESKTOP-HNB502N:/mnt/c/Users/Administrator/workspaces/rust/candle$ cargo r -p candle-examples --example yi --features cuda -- --prompt 怎么学习rust语言
    Finished dev [unoptimized + debuginfo] target(s) in 4.30s
     Running `target/debug/examples/yi --prompt '怎么学习rust语言'`
avx: true, neon: false, simd128: false, f16c: true
temp: 0.00 repeat-penalty: 1.10 repeat-last-n: 64
retrieved the files in 59.952703ms
Error: DriverError(CUDA_ERROR_NOT_FOUND, "named symbol not found") when loading cast_f32_bf16

LaurentMazare commented 11 months ago

Error: DriverError(CUDA_ERROR_NOT_FOUND, "named symbol not found") when loading cast_f32_bf16 at windows 11 wsl2 ubuntu 22.04

We require a compute_cap of at least 8.0 to enable the bf16 support but the RTX 2080 only has support for compute cap 7.5 so you will need a more recent GPU to run the bf16 based models.

milewski commented 10 months ago

All the examples I tried doesnt work for me:

> nvidia-smi --query-gpu=name,compute_cap,driver_version --format=csv

name, compute_cap, driver_version
NVIDIA GeForce RTX 2080 with Max-Q Design, 7.5, 546.12

> nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Fri_Nov__3_17:51:05_Pacific_Daylight_Time_2023
Cuda compilation tools, release 12.3, V12.3.103
Build cuda_12.3.r12.3/compiler.33492891_0

I have tried to run it on windows and wsl2 .. and this is the error I get:

Error: DriverError(CUDA_ERROR_INVALID_PTX, "a PTX JIT compilation failed") when loading cast_f16_f32

And some other examples throws this error:

Error: DriverError(CUDA_ERROR_INVALID_PTX, "a PTX JIT compilation failed") when loading is_u32_f32

fax1ty commented 8 months ago

> nvcc --version                                                   
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Wed_Feb__8_05:53:42_Coordinated_Universal_Time_2023
Cuda compilation tools, release 12.1, V12.1.66
Build cuda_12.1.r12.1/compiler.32415258_0

> nvidia-smi --query-gpu=name,compute_cap,driver_version --format=csv 
name, compute_cap, driver_version
NVIDIA GeForce RTX 2080, 7.5, 551.23

Microsoft Windows [Version 10.0.19045.4046]

cargo:rustc-env=CUDA_COMPUTE_CAP=75 cargo:info=Builder { cuda_root: Some("C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1"), kernel_paths: ["src\affine.cu", "src\binary.cu", "src\cast.cu", "src\conv.cu", "src\fill.cu", "src\indexing.cu", "src\reduce.cu", "src\ternary.cu", "src\unary.cu"], watch: [], include_paths: ["src\binary_op_macros.cuh", "src\compatibility.cuh", "src\cuda_utils.cuh"], compute_cap: Some(75), out_dir: "D:\Desktop\generative\src-tauri\target\debug\build\candle-kernels-2aac372284d54b6a\out", extra_args: [] } cargo:rustc-env=CUDA_INCLUDE_DIR=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1\include

thread 'main' panicked at C:\Users\fax1t.cargo\registry\src\index.crates.io-6f17d22bba15001f\bindgen_cuda-0.1.4\src\lib.rs:389:13: nvcc error while compiling "src\affine.cu":

oiwn commented 8 months ago

Stuck with same issue trying to compile for NVIDIA Jetson Nano, while it has 4GB of RAM and pretty capable to run 7B quantized models this device barely supported by anyone else. Due to issues with cuda, old versions of nvcc shipped with jetpack. Nvidia has horrible support for own tech.

oiwn commented 8 months ago

It's failed to build with

error: failed to run custom build command for `candle-kernels v0.4.0 (/home/oiwn/code/candle/candle-kernels)`

Caused by:
  process didn't exit successfully: `/home/oiwn/code/candle/target/release/build/candle-kernels-b78e0c7d3d2aa31c/build-script-build` (exit status: 101)
  --- stdout
  cargo:rerun-if-changed=build.rs
  cargo:info=["/usr", "/usr/local/cuda", "/opt/cuda", "/usr/lib/cuda", "C:/Program Files/NVIDIA GPU Computing Toolkit", "C:/CUDA"]
  cargo:rerun-if-env-changed=CUDA_COMPUTE_CAP
  cargo:rustc-env=CUDA_COMPUTE_CAP=53

  --- stderr
  thread 'main' panicked at /home/oiwn/.cargo/registry/src/index.crates.io-6f17d22bba15001f/bindgen_cuda-0.1.4/src/lib.rs:519:43:
  no gpu codes parsed from nvcc
  note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

Which is strange cuda-kernels dependency is "bindgen_cuda = 0.1.1" which fail on this line: https://github.com/Narsil/bindgen_cuda/blob/main/src/lib.rs#L519

Tyler-Hardin commented 8 months ago

I'm hitting this issue with an L4...

tyler@srv:~/candle$ nvidia-smi --query-gpu=compute_cap --format=csv,noheader
8.9
tyler@srv:~/candle$ CUDA_COMPUTE_CAP=89 cargo run --example mistral --features cudnn --release -- --prompt "Here is a sample quick sort implementation in rust " -n 400
    Finished release [optimized] target(s) in 0.22s
     Running `target/release/examples/mistral --prompt 'Here is a sample quick sort implementation in rust ' -n 400`
avx: true, neon: false, simd128: false, f16c: true
temp: 0.00 repeat-penalty: 1.10 repeat-last-n: 64
retrieved the files in 20.977633ms
Error: DriverError(CUDA_ERROR_NOT_FOUND, "named symbol not found") when loading cast_f32_bf16

tbogdala commented 8 months ago

thread 'main' panicked at C:\Users\fax1t.cargo\registry\src\index.crates.io-6f17d22bba15001f\bindgen_cuda-0.1.4\src\lib.rs:389:13: nvcc error while compiling "src\affine.cu":

I was getting these errors as well when using the 'Developer Command Prompt for VS 2022'.

Switching to the 'x64 Native Tools Command Prompt' solved the problem. Examples and my projects compile with the cuda feature now and work appropriately.

nklsla commented 7 months ago

I have tried to run it on windows and wsl2 .. and this is the error I get:
Error: DriverError(CUDA_ERROR_INVALID_PTX, "a PTX JIT compilation failed") when loading cast_f16_f32
And some other examples throws this error:
Error: DriverError(CUDA_ERROR_INVALID_PTX, "a PTX JIT compilation failed") when loading is_u32_f32

Experiencing the same issue Building with CUDA_COMPUTE_CAP=86

_CAP=75 gives this compiler error

error: failed to run custom build command for `candle-kernels v0.4.1` Caused by: process didn't exit successfully: `/home//projects/rs-ml-lab/target/debug/build/candle-kernels-6070d4e13fa3aae8/build-script-build` (exit status: 101) --- stdout cargo:rerun-if-changed=build.rs cargo:info=["/usr", "/usr/local/cuda", "/opt/cuda", "/usr/lib/cuda", "C:/Program Files/NVIDIA GPU Computing Toolkit", "C:/CUDA"] cargo:rerun-if-env-changed=CUDA_COMPUTE_CAP cargo:rustc-env=CUDA_COMPUTE_CAP=75 cargo:info=Builder { cuda_root: Some("/usr"), kernel_paths: ["src/affine.cu", "src/binary.cu", "src/cast.cu", "src/conv.cu", "src/fill.cu", "src/indexing.cu", "src/quantized.cu", "src/reduce.cu", "src/ternary.cu", "src/unary.cu"], watch: [], include_paths: ["src/binary_op_macros.cuh", "src/compatibility.cuh", "src/cuda_utils.cuh"], compute_cap: Some(75), out_dir: "/home/niklas/projects/rs-ml-lab/target/debug/build/candle-kernels-310cb726e1654680/out", extra_args: [] } cargo:rustc-env=CUDA_INCLUDE_DIR=/usr/include cargo:rerun-if-changed=src/binary_op_macros.cuh cargo:rerun-if-changed=src/compatibility.cuh cargo:rerun-if-changed=src/cuda_utils.cuh cargo:rerun-if-env-changed=NVCC_CCBIN cargo:rerun-if-changed=src/affine.cu cargo:rerun-if-changed=src/indexing.cu cargo:rerun-if-changed=src/reduce.cu cargo:rerun-if-changed=src/cast.cu cargo:rerun-if-changed=src/quantized.cu cargo:rerun-if-changed=src/ternary.cu cargo:rerun-if-changed=src/unary.cu cargo:rerun-if-changed=src/conv.cu cargo:rerun-if-changed=src/binary.cu cargo:rerun-if-changed=src/fill.cu --- stderr src/compatibility.cuh(11): error: identifier "__hmax" is undefined src/compatibility.cuh(11): error: ambiguous "?" operation: second operand of type "__half" can be converted to third operand type "", and vice versa src/compatibility.cuh(11): error: ambiguous "?" operation: second operand of type "__half" can be converted to third operand type "", and vice versa src/compatibility.cuh(14): error: identifier "__hmin" is undefined src/compatibility.cuh(14): error: ambiguous "?" operation: second operand of type "__half" can be converted to third operand type "", and vice versa src/compatibility.cuh(14): error: ambiguous "?" operation: second operand of type "__half" can be converted to third operand type "", and vice versa src/compatibility.cuh(11): error: identifier "__hmax" is undefined src/compatibility.cuh(11): error: ambiguous "?" operation: second operand of type "__half" can be converted to third operand type "", and vice versa src/compatibility.cuh(11): error: ambiguous "?" operation: second operand of type "__half" can be converted to third operand type "", and vice versa src/compatibility.cuh(14): error: identifier "__hmin" is undefined src/compatibility.cuh(14): error: ambiguous "?" operation: second operand of type "__half" can be converted to third operand type "", and vice versa src/compatibility.cuh(14): error: ambiguous "?" operation: second operand of type "__half" can be converted to third operand type "", and vice versa src/compatibility.cuh(11): error: identifier "__hmax" is undefined src/compatibility.cuh(11): error: ambiguous "?" operation: second operand of type "__half" can be converted to third operand type "", and vice versa src/compatibility.cuh(11): error: ambiguous "?" operation: second operand of type "__half" can be converted to third operand type "", and vice versa src/compatibility.cuh(14): error: identifier "__hmin" is undefined src/compatibility.cuh(14): error: ambiguous "?" operation: second operand of type "__half" can be converted to third operand type "", and vice versa src/compatibility.cuh(14): error: ambiguous "?" operation: second operand of type "__half" can be converted to third operand type "", and vice versa src/compatibility.cuh(11): error: identifier "__hmax" is undefined src/compatibility.cuh(11): error: identifier "__hmax" is undefined src/compatibility.cuh(11): error: ambiguous "?" operation: second operand of type "__half" can be converted to third operand type "", and vice versa src/compatibility.cuh(11): error: ambiguous "?" operation: second operand of type "__half" can be converted to third operand type "", and vice versa src/compatibility.cuh(14): error: identifier "__hmin" is undefined src/compatibility.cuh(14): error: ambiguous "?" operation: second operand of type "__half" can be converted to third operand type "", and vice versa src/compatibility.cuh(14): error: ambiguous "?" operation: second operand of type "__half" can be converted to third operand type "", and vice versa src/compatibility.cuh(11): error: ambiguous "?" operation: second operand of type "__half" can be converted to third operand type "", and vice versa src/compatibility.cuh(11): error: ambiguous "?" operation: second operand of type "__half" can be converted to third operand type "", and vice versa src/compatibility.cuh(14): error: identifier "__hmin" is undefined src/compatibility.cuh(14): error: ambiguous "?" operation: second operand of type "__half" can be converted to third operand type "", and vice versa src/compatibility.cuh(14): error: ambiguous "?" operation: second operand of type "__half" can be converted to third operand type "", and vice versa 6 errors detected in the compilation of "src/ternary.cu". 6 errors detected in the compilation of "src/reduce.cu". 6 errors detected in the compilation of "src/affine.cu". 6 errors detected in the compilation of "src/indexing.cu". 6 errors detected in the compilation of "src/cast.cu". src/quantized.cu(261): warning #181-D: argument is incompatible with corresponding format string conversion detected during instantiation of "float vec_dot_q4_0_q8_1_impl(const int *, const int *, const float &, const half2 &) [with vdr=4]" (282): here src/quantized.cu(261): warning #181-D: argument is incompatible with corresponding format string conversion detected during instantiation of "float vec_dot_q4_0_q8_1_impl(const int *, const int *, const float &, const half2 &) [with vdr=4]" (282): here src/quantized.cu(26): warning #177-D: function "get_int_from_int8" was declared but never referenced src/quantized.cu(50): warning #177-D: function "get_int_from_uint8_aligned" was declared but never referenced src/compatibility.cuh(11): error: identifier "__hmax" is undefined src/compatibility.cuh(11): error: ambiguous "?" operation: second operand of type "__half" can be converted to third operand type "", and vice versa src/compatibility.cuh(11): error: ambiguous "?" operation: second operand of type "__half" can be converted to third operand type "", and vice versa src/compatibility.cuh(14): error: identifier "__hmin" is undefined src/compatibility.cuh(14): error: ambiguous "?" operation: second operand of type "__half" can be converted to third operand type "", and vice versa src/compatibility.cuh(14): error: ambiguous "?" operation: second operand of type "__half" can be converted to third operand type "", and vice versa src/compatibility.cuh(11): error: identifier "__hmax" is undefined src/compatibility.cuh(11): error: ambiguous "?" operation: second operand of type "__half" can be converted to third operand type "", and vice versa src/compatibility.cuh(11): error: ambiguous "?" operation: second operand of type "__half" can be converted to third operand type "", and vice versa src/compatibility.cuh(14): error: identifier "__hmin" is undefined src/compatibility.cuh(14): error: ambiguous "?" operation: second operand of type "__half" can be converted to third operand type "", and vice versa src/compatibility.cuh(14): error: ambiguous "?" operation: second operand of type "__half" can be converted to third operand type "", and vice versa src/compatibility.cuh(11): error: identifier "__hmax" is undefined src/compatibility.cuh(11): error: ambiguous "?" operation: second operand of type "__half" can be converted to third operand type "", and vice versa src/compatibility.cuh(11): error: ambiguous "?" operation: second operand of type "__half" can be converted to third operand type "", and vice versa src/compatibility.cuh(14): error: identifier "__hmin" is undefined src/compatibility.cuh(14): error: ambiguous "?" operation: second operand of type "__half" can be converted to third operand type "", and vice versa src/compatibility.cuh(14): error: ambiguous "?" operation: second operand of type "__half" can be converted to third operand type "", and vice versa 6 errors detected in the compilation of "src/unary.cu". 6 errors detected in the compilation of "src/binary.cu". 6 errors detected in the compilation of "src/conv.cu". thread 'main' panicked at /home/niklas/.cargo/registry/src/index.crates.io-6f17d22bba15001f/bindgen_cuda-0.1.4/src/lib.rs:389:13: nvcc error while compiling "src/affine.cu": # stdout # stderr note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

I tried to avoid type casting but ended up with

Error: DriverError(CUDA_ERROR_INVALID_PTX, "a PTX JIT compilation failed") when loading affine_f32

My env: Ubunutu 22.04 server

$ uname -a
Linux nl 6.5.0-25-generic #25~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Tue Feb 20 16:09:15 UTC 2 x86_64 x86_64 x86_64 GNU/Linux

$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Thu_Nov_18_09:45:30_PST_2021
Cuda compilation tools, release 11.5, V11.5.119
Build cuda_11.5.r11.5/compiler.30672275_0

$ nvidia-smi --query-gpu=name,compute_cap,driver_version --format=csv 
name, compute_cap, driver_version
Quadro RTX 4000, 7.5, 545.29.06

Im trying to complie the ViT-model. I've tried both --release and normal debug complie mode

kulame commented 7 months ago

meet same error on windows 11

PS C:\>  nvidia-smi --query-gpu=name,compute_cap,driver_version --format=csv
NVIDIA GeForce RTX 2080 Ti, 7.5, 551.86

PS C:\> nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Thu_Mar_28_02:30:10_Pacific_Daylight_Time_2024
Cuda compilation tools, release 12.4, V12.4.131
Build cuda_12.4.r12.4/compiler.34097967_0

evgenyigumnov commented 7 months ago

have the same problem

root@C.10515727:~/ai-server$ cargo run
    Finished dev [unoptimized + debuginfo] target(s) in 0.17s
     Running `target/debug/ai-server`
retrieved the files in 16.361172ms
Error: DriverError(CUDA_ERROR_NOT_FOUND, "named symbol not found") when loading cast_f32_bf16
root@C.10515727:~/ai-server$ nvidia-smi --query-gpu=name,compute_cap,driver_version --format=csv
name, compute_cap, driver_version
NVIDIA GeForce RTX 2080 Ti, 7.5, 535.161.07
NVIDIA GeForce RTX 2080 Ti, 7.5, 535.161.07
NVIDIA GeForce RTX 2080 Ti, 7.5, 535.161.07
NVIDIA GeForce RTX 2080 Ti, 7.5, 535.161.07                                                                                                                                                                                                                                   NVIDIA GeForce RTX 2080 Ti, 7.5, 535.161.07
NVIDIA GeForce RTX 2080 Ti, 7.5, 535.161.07
NVIDIA GeForce RTX 2080 Ti, 7.5, 535.161.07
NVIDIA GeForce RTX 2080 Ti, 7.5, 535.161.07
root@C.10515727:~/ai-server$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Fri_Jan__6_16:45:21_PST_2023
Cuda compilation tools, release 12.0, V12.0.140
Build cuda_12.0.r12.0/compiler.32267302_0

evgenyigumnov commented 7 months ago


root@C.10515727:~/ai-server$ uname -a
Linux fb0f7633e4cb 5.4.0-172-generic #190-Ubuntu SMP Fri Feb 2 23:24:
```22 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

maulberto3 commented 4 months ago

Yes Windows seems to be having issues. I've been told in discord WSL is ok.

In my case WSL2, although cuda is correctly installed (as tried by other crates through env vars and all), candle works well WITHOUT CUDA but not WITH CUDA, my error says thread 'main' panicked at /home/mau/.cargo/registry/src/index.crates.io-6f17d22bba15001f/cudarc-0.11.6/src/lib.rs:98:5: : Unable to dynamically load the "cuda" shared library - searched for library names: ["cuda", "nvcuda"]. but env vars are correctly set up...

briancampo commented 3 months ago

In my case WSL2, although cuda is correctly installed (as tried by other crates through env vars and all), candle works well WITHOUT CUDA but not WITH CUDA, my error says thread 'main' panicked at /home/mau/.cargo/registry/src/index.crates.io-6f17d22bba15001f/cudarc-0.11.6/src/lib.rs:98:5: : Unable to dynamically load the "cuda" shared library - searched for library names: ["cuda", "nvcuda"]. but env vars are correctly set up...

Was having this issue, but was able to resolve it with a couple of things:

Run ldconfig to see if you have an issue with simlinking of libcuda. If so you just need to move the original to *.bak and then simlink the original to the .bak file. This is usually just masking other issues though.
The other time I had this issue was when I upgraded the windows driver and then created a new distro that had a different toolkit version than tthe driver version on the windows side.

If that doesn't help let reach out and I can share my install script that seems to be repeatable for me at least. I tear down my WSL distro every couple of months and it becomes pretty repeatable.

huggingface / candle

Cannot run examples with --features cuda option #353

stdout

stderr