huggingface / text-embeddings-inference

A blazing fast inference solution for text embeddings models
https://huggingface.co/docs/text-embeddings-inference/quick_tour
Apache License 2.0
2.83k stars 177 forks source link

'ptxas' died due to signal 11 (Invalid memory reference) #322

Open Semihal opened 4 months ago

Semihal commented 4 months ago

System Info

Version: v.1.4.0 Cargo version: cargo 1.79.0 (ffa9cf99a 2024-06-03) GCC version: 11.4.1 GPU: Compile with CUDA_COMPUTE_CAP=86 on machine without GPU (but with CUDA 12.1). I plan to use this container with A40, but I don't have a GPU to build it.

Information

Tasks

Reproduction

I start this script:

export CUDA_COMPUTE_CAP=86
export CUDA_HOME=/usr/local/cuda-12.1
export PATH=${PATH}:/usr/local/cuda-12.1/bin
# Limit parallelism
export CARGO_BUILD_JOBS=1
export RAYON_NUM_THREADS=1
export CARGO_BUILD_INCREMENTAL=true

cd /usr/src/text-embeddings-inference || true

nvprune \
  --generate-code code=sm_80 \
  --generate-code code=sm_${CUDA_COMPUTE_CAP} \
  /usr/local/cuda/lib64/libcublas_static.a -o /usr/local/cuda/lib64/libcublas_static.a

cargo chef cook --release \
  --features candle-cuda \
  --features static-linking \
  --no-default-features \
  --recipe-path recipe.json && \
   sccache -s

I get this error:

[18:29:50] :     [Step 1/2]  [0m [91merror: failed to run custom build command for `candle-flash-attn v0.5.0 (https://github.com/OlivierDehaene/candle?rev=33b7ecf9ed82bb7c20f1a94555218fabfbaa2fe3#33b7ecf9)`
[18:29:50] :     [Step 1/2] 
[18:29:50] :     [Step 1/2] Caused by:
[18:29:50] :     [Step 1/2]  [0m [91m  process didn't exit successfully: `/usr/src/text-embeddings-inference/target/release/build/candle-flash-attn-67bc68aa050514c7/build-script-build` (exit status: 101)
[18:29:50] :     [Step 1/2]   --- stdout
[18:29:50] :     [Step 1/2]   cargo:rerun-if-changed=build.rs
[18:29:50] :     [Step 1/2]   cargo:rerun-if-changed=kernels/flash_api.cu
[18:29:50] :     [Step 1/2]   cargo:rerun-if-changed=kernels/flash_fwd_hdim128_fp16_sm80.cu
[18:29:50] :     [Step 1/2]   cargo:rerun-if-changed=kernels/flash_fwd_hdim160_fp16_sm80.cu
[18:29:50] :     [Step 1/2]   cargo:rerun-if-changed=kernels/flash_fwd_hdim192_fp16_sm80.cu
[18:29:50] :     [Step 1/2]   cargo:rerun-if-changed=kernels/flash_fwd_hdim224_fp16_sm80.cu
[18:29:50] :     [Step 1/2]   cargo:rerun-if-changed=kernels/flash_fwd_hdim256_fp16_sm80.cu
[18:29:50] :     [Step 1/2]   cargo:rerun-if-changed=kernels/flash_fwd_hdim32_fp16_sm80.cu
[18:29:50] :     [Step 1/2]   cargo:rerun-if-changed=kernels/flash_fwd_hdim64_fp16_sm80.cu
[18:29:50] :     [Step 1/2]   cargo:rerun-if-changed=kernels/flash_fwd_hdim96_fp16_sm80.cu
[18:29:50] :     [Step 1/2]   cargo:rerun-if-changed=kernels/flash_fwd_hdim128_bf16_sm80.cu
[18:29:50] :     [Step 1/2]   cargo:rerun-if-changed=kernels/flash_fwd_hdim160_bf16_sm80.cu
[18:29:50] :     [Step 1/2]   cargo:rerun-if-changed=kernels/flash_fwd_hdim192_bf16_sm80.cu
[18:29:50] :     [Step 1/2]   cargo:rerun-if-changed=kernels/flash_fwd_hdim224_bf16_sm80.cu
[18:29:50] :     [Step 1/2]   cargo:rerun-if-changed=kernels/flash_fwd_hdim256_bf16_sm80.cu
[18:29:50] :     [Step 1/2]   cargo:rerun-if-changed=kernels/flash_fwd_hdim32_bf16_sm80.cu
[18:29:50] :     [Step 1/2]   cargo:rerun-if-changed=kernels/flash_fwd_hdim64_bf16_sm80.cu
[18:29:50] :     [Step 1/2]   cargo:rerun-if-changed=kernels/flash_fwd_hdim96_bf16_sm80.cu
[18:29:50] :     [Step 1/2]   cargo:rerun-if-changed=kernels/flash_fwd_kernel.h
[18:29:50] :     [Step 1/2]   cargo:rerun-if-changed=kernels/flash_fwd_launch_template.h
[18:29:50] :     [Step 1/2]   cargo:rerun-if-changed=kernels/flash.h
[18:29:50] :     [Step 1/2]   cargo:rerun-if-changed=kernels/philox.cuh
[18:29:50] :     [Step 1/2]   cargo:rerun-if-changed=kernels/softmax.h
[18:29:50] :     [Step 1/2]   cargo:rerun-if-changed=kernels/utils.h
[18:29:50] :     [Step 1/2]   cargo:rerun-if-changed=kernels/kernel_traits.h
[18:29:50] :     [Step 1/2]   cargo:rerun-if-changed=kernels/block_info.h
[18:29:50] :     [Step 1/2]   cargo:rerun-if-changed=kernels/static_switch.h
[18:29:50] :     [Step 1/2]   cargo:info=["/usr", "/usr/local/cuda", "/opt/cuda", "/usr/lib/cuda", "C:/Program Files/NVIDIA GPU Computing Toolkit", "C:/CUDA"]
[18:29:50] :     [Step 1/2]   cargo:rerun-if-env-changed=CUDA_COMPUTE_CAP
[18:29:50] :     [Step 1/2]   cargo:rustc-env=CUDA_COMPUTE_CAP=86
[18:29:50] :     [Step 1/2] 
[18:29:50] :     [Step 1/2]   --- stderr

[....]

[18:29:50] :     [Step 1/2]   #$ CUDAFE_FLAGS=
[18:29:50] :     [Step 1/2]   #$ PTXAS_FLAGS=
[18:29:50] :     [Step 1/2]   #$ gcc -std=c++17 -D__CUDA_ARCH_LIST__=860 -E -x c++ -D__CUDACC__ -D__NVCC__ -D__CUDACC_EXTENDED_LAMBDA__ -D__CUDACC_RELAXED_CONSTEXPR__  -O3 -I"cutlass/include" "-I/usr/local/cuda-12.1/bin/../targets/x86_64-linux/include"    -U "__CUDA_NO_HALF_OPERATORS__" -U "__CUDA_NO_HALF_CONVERSIONS__" -U "__CUDA_NO_HALF2_OPERATORS__" -U "__CUDA_NO_BFLOAT16_CONVERSIONS__" -D__CUDACC_VER_MAJOR__=12 -D__CUDACC_VER_MINOR__=1 -D__CUDACC_VER_BUILD__=105 -D__CUDA_API_VER_MAJOR__=12 -D__CUDA_API_VER_MINOR__=1 -DCUDA_API_PER_THREAD_DEFAULT_STREAM=1 -D__NVCC_DIAG_PRAGMA_SUPPORT__=1 -include "cuda_runtime.h" -m64 "kernels/flash_fwd_hdim32_bf16_sm80.cu" -o "/tmp/tmpxft_000017c2_00000000-5_flash_fwd_hdim32_bf16_sm80.cpp4.ii" 
[18:29:50] :     [Step 1/2]   #$ cudafe++ --c++17 --gnu_version=110401 --display_error_number --orig_src_file_name "kernels/flash_fwd_hdim32_bf16_sm80.cu" --orig_src_path_name "/root/.cargo/git/checkouts/candle-2c6db576e0f06e81/33b7ecf/candle-flash-attn/kernels/flash_fwd_hdim32_bf16_sm80.cu" --allow_managed --extended-lambda --relaxed_constexpr  --m64 --parse_templates --gen_c_file_name "/tmp/tmpxft_000017c2_00000000-6_flash_fwd_hdim32_bf16_sm80.cudafe1.cpp" --stub_file_name "tmpxft_000017c2_00000000-6_flash_fwd_hdim32_bf16_sm80.cudafe1.stub.c" --gen_module_id_file --module_id_file_name "/tmp/tmpxft_000017c2_00000000-4_flash_fwd_hdim32_bf16_sm80.module_id" "/tmp/tmpxft_000017c2_00000000-5_flash_fwd_hdim32_bf16_sm80.cpp4.ii" 
[18:29:50] :     [Step 1/2]   #$ gcc -std=c++17 -D__CUDA_ARCH__=860 -D__CUDA_ARCH_LIST__=860 -E -x c++  -DCUDA_DOUBLE_MATH_FUNCTIONS -D__CUDACC__ -D__NVCC__ -D__CUDACC_EXTENDED_LAMBDA__ -D__CUDACC_RELAXED_CONSTEXPR__  -O3 -I"cutlass/include" "-I/usr/local/cuda-12.1/bin/../targets/x86_64-linux/include"    -U "__CUDA_NO_HALF_OPERATORS__" -U "__CUDA_NO_HALF_CONVERSIONS__" -U "__CUDA_NO_HALF2_OPERATORS__" -U "__CUDA_NO_BFLOAT16_CONVERSIONS__" -D__CUDACC_VER_MAJOR__=12 -D__CUDACC_VER_MINOR__=1 -D__CUDACC_VER_BUILD__=105 -D__CUDA_API_VER_MAJOR__=12 -D__CUDA_API_VER_MINOR__=1 -DCUDA_API_PER_THREAD_DEFAULT_STREAM=1 -D__NVCC_DIAG_PRAGMA_SUPPORT__=1 -include "cuda_runtime.h" -m64 "kernels/flash_fwd_hdim32_bf16_sm80.cu" -o "/tmp/tmpxft_000017c2_00000000-7_flash_fwd_hdim32_bf16_sm80.cpp1.ii" 
[18:29:50] :     [Step 1/2]   #$ cicc --c++17 --gnu_version=110401 --display_error_number --orig_src_file_name "kernels/flash_fwd_hdim32_bf16_sm80.cu" --orig_src_path_name "/root/.cargo/git/checkouts/candle-2c6db576e0f06e81/33b7ecf/candle-flash-attn/kernels/flash_fwd_hdim32_bf16_sm80.cu" --allow_managed --extended-lambda --relaxed_constexpr   -arch compute_86 -m64 --no-version-ident -ftz=1 -prec_div=0 -prec_sqrt=0 -fmad=1 -fast-math --gen_div_approx_ftz --include_file_name "tmpxft_000017c2_00000000-3_flash_fwd_hdim32_bf16_sm80.fatbin.c" -tused --module_id_file_name "/tmp/tmpxft_000017c2_00000000-4_flash_fwd_hdim32_bf16_sm80.module_id" --gen_c_file_name "/tmp/tmpxft_000017c2_00000000-6_flash_fwd_hdim32_bf16_sm80.cudafe1.c" --stub_file_name "/tmp/tmpxft_000017c2_00000000-6_flash_fwd_hdim32_bf16_sm80.cudafe1.stub.c" --gen_device_file_name "/tmp/tmpxft_000017c2_00000000-6_flash_fwd_hdim32_bf16_sm80.cudafe1.gpu"  "/tmp/tmpxft_000017c2_00000000-7_flash_fwd_hdim32_bf16_sm80.cpp1.ii" -o "/tmp/tmpxft_000017c2_00000000-6_flash_fwd_hdim32_bf16_sm80.ptx"
[18:29:50] :     [Step 1/2]   #$ ptxas -arch=sm_86 -m64  "/tmp/tmpxft_000017c2_00000000-6_flash_fwd_hdim32_bf16_sm80.ptx"  -o "/tmp/tmpxft_000017c2_00000000-8_flash_fwd_hdim32_bf16_sm80.sm_86.cubin" 
[18:29:50] :     [Step 1/2]   nvcc error   : 'ptxas' died due to signal 11 (Invalid memory reference)
[18:29:50] :     [Step 1/2]   nvcc error   : 'ptxas' core dumped
[18:29:50] :     [Step 1/2]   # --error 0x8b --
[18:29:50] :     [Step 1/2]   thread '<unnamed>' panicked at /root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/bindgen_cuda-0.1.5/src/lib.rs:262:21:
[18:29:50] :     [Step 1/2]   nvcc error while executing compiling: "nvcc" "--gpu-architecture=sm_86" "-c" "-o" "/usr/src/text-embeddings-inference/target/release/build/candle-flash-attn-6656f6d321f9dddf/out/flash_fwd_hdim32_bf16_sm80-aca7d8fdce93ef53.o" "--default-stream" "per-thread" "-std=c++17" "-O3" "-U__CUDA_NO_HALF_OPERATORS__" "-U__CUDA_NO_HALF_CONVERSIONS__" "-U__CUDA_NO_HALF2_OPERATORS__" "-U__CUDA_NO_BFLOAT16_CONVERSIONS__" "-Icutlass/include" "--expt-relaxed-constexpr" "--expt-extended-lambda" "--use_fast_math" "--verbose" "kernels/flash_fwd_hdim32_bf16_sm80.cu"
[18:29:50] :     [Step 1/2] 
[18:29:50] :     [Step 1/2]   # stdout
[18:29:50] :     [Step 1/2] 
[18:29:50] :     [Step 1/2] 
[18:29:50] :     [Step 1/2]   # stderr
[18:29:50] :     [Step 1/2] 
[18:29:50] :     [Step 1/2]   note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
[18:29:51] :     [Step 1/2]  [0m [91mthread 'main' panicked at /root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/cargo-chef-0.1.67/src/recipe.rs:218:27:
[18:29:51] :     [Step 1/2] Exited with status code: 101
[18:29:51] :     [Step 1/2] note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
[18:29:59]W:     [Step 1/2] The command '/bin/sh -c docker/build' returned a non-zero code: 101

Expected behavior

TEI compiled.

OlivierDehaene commented 4 months ago

I plan to use this container

I'm confused, do you want a container or a binary? If you want a container why not use the official one or the official command?

Semihal commented 4 months ago

I'm confused, do you want a container or a binary?

I want to install TEI in a container image for future use.

If you want a container why not use the official one or the official command?

These are the instructions from the official Docker.

Semihal commented 4 months ago

For clarity. The executable code looks exactly like this (from the official Docker image):

export CUDA_COMPUTE_CAP=86
export CUDA_HOME=/usr/local/cuda-12.1
export PATH=${PATH}:/usr/local/cuda-12.1/bin
# Limit parallelism
export CARGO_BUILD_JOBS=1
export RAYON_NUM_THREADS=1
export CARGO_BUILD_INCREMENTAL=true

if [ ${CUDA_COMPUTE_CAP} -ge 75 -a ${CUDA_COMPUTE_CAP} -lt 80 ];
then
    nvprune \
      --generate-code code=sm_${CUDA_COMPUTE_CAP} \
      /usr/local/cuda/lib64/libcublas_static.a -o /usr/local/cuda/lib64/libcublas_static.a;
elif [ ${CUDA_COMPUTE_CAP} -ge 80 -a ${CUDA_COMPUTE_CAP} -lt 90 ];
then
    nvprune \
      --generate-code code=sm_80 \
      --generate-code code=sm_${CUDA_COMPUTE_CAP} \
      /usr/local/cuda/lib64/libcublas_static.a -o /usr/local/cuda/lib64/libcublas_static.a;
elif [ ${CUDA_COMPUTE_CAP} -eq 90 ];
then
    nvprune \
      --generate-code code=sm_90 \
      /usr/local/cuda/lib64/libcublas_static.a -o /usr/local/cuda/lib64/libcublas_static.a;
else
    echo "cuda compute cap ${CUDA_COMPUTE_CAP} is not supported"; exit 1;
fi;

if [ ${CUDA_COMPUTE_CAP} -ge 75 -a ${CUDA_COMPUTE_CAP} -lt 80 ];
then
    cargo chef cook --release \
      --features candle-cuda-turing \
      --features static-linking \
      --no-default-features \
      --recipe-path recipe.json && \
      sccache -s;
else
    cargo chef cook --release \
      --features candle-cuda \
      --features static-linking \
      --no-default-features \
      --recipe-path recipe.json && \
      sccache -s;
fi;