mlc-ai / llm-perf-bench

Apache License 2.0
108 stars 12 forks source link

CUDA error: no kernel image is available for execution on the device #3

Closed sleepwalker2017 closed 11 months ago

sleepwalker2017 commented 11 months ago

Hi I use the commands in the README to run this project.

As you don't specify the model repo, I download model from here: https://huggingface.co/daryl149/llama-2-7b-chat-hf

The build process is ok. But when I run the built model, it complains:

Use MLC config: "/mlc-llm/dist/llama-2-7b-chat-hf-q4f16_1/params/mlc-chat-config.json"
Use model weights: "/mlc-llm/dist/llama-2-7b-chat-hf-q4f16_1/params/ndarray-cache.json"
Use model library: "/mlc-llm/dist/llama-2-7b-chat-hf-q4f16_1/llama-2-7b-chat-hf-q4f16_1-cuda.so"
You can use the following special commands:
  /help               print the special commands
  /exit               quit the cli
  /stats              print out the latest stats (token/sec)
  /reset              restart a fresh chat
  /reload [local_id]  reload model `local_id` from disk, or reload the current model if `local_id` is not specified

Loading model...
Loading finished
Running system prompts...
CUDA error: no kernel image is available for execution on the device
Aborted
sleepwalker2017 commented 11 months ago

My GPU and CUDA are as below

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.85.02    Driver Version: 510.85.02    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  Off  | 00000000:18:00.0 Off |                    0 |
| N/A   44C    P0    56W / 300W |      0MiB / 32768MiB |      0%      Default |
|                               |                      |                  N/A |
sleepwalker2017 commented 11 months ago
root@mlc-perf:/mlc-llm# micromamba activate python311
(python311) root@mlc-perf:/mlc-llm# sh model_build.sh /models/llama-2-7b-chat-hf
Using path "/models/llama-2-7b-chat-hf" for model "llama-2-7b-chat-hf"
Database paths: ['log_db/vicuna-v1-7b', 'log_db/redpajama-3b-q4f16', 'log_db/rwkv-raven-1b5', 'log_db/redpajama-3b-q4f32', 'log_db/rwkv-raven-7b', 'log_db/rwkv-raven-3b', 'log_db/dolly-v2-3b']
Target configured: cuda -keys=cuda,gpu -arch=sm_70 -max_num_threads=1024 -thread_warp_size=32
Automatically using target for weight quantization: cuda -keys=cuda,gpu -arch=sm_70 -max_num_threads=1024 -max_shared_memory_per_block=49152 -max_threads_per_block=1024 -registers_per_block=65536 -thread_warp_size=32
Start computing and quantizing weights... This may take a while.
Finish computing and quantizing weights.
Total param size: 3.5313796997070312 GB
Start storing to cache ./dist/llama-2-7b-chat-hf-q4f16_1/params
[0327/0327] saving param_326
All finished, 115 total shards committed, record saved to ./dist/llama-2-7b-chat-hf-q4f16_1/params/ndarray-cache.json
Finish exporting chat config to ./dist/llama-2-7b-chat-hf-q4f16_1/params/mlc-chat-config.json
[07:14:28] /mlc-llm/3rdparty/tvm/include/tvm/topi/transform.h:1076: Warning: Fast mode segfaults when there are out-of-bounds indices. Make sure input indices are in bound
[07:14:29] /mlc-llm/3rdparty/tvm/include/tvm/topi/transform.h:1076: Warning: Fast mode segfaults when there are out-of-bounds indices. Make sure input indices are in bound
Save a cached module to ./dist/llama-2-7b-chat-hf-q4f16_1/mod_cache_before_build.pkl.
Finish exporting to ./dist/llama-2-7b-chat-hf-q4f16_1/llama-2-7b-chat-hf-q4f16_1-cuda.so

the build log

junrushao commented 11 months ago

Your cuda version is 11.6 which is pretty old. It’s recommended to use 11.8 or 12.1 if you wanted to give it a shot.

BTW, would you mind trying out the dockerfile? It’s based on 12.1 and should work out of the box

sleepwalker2017 commented 11 months ago

Your cuda version is 11.6 which is pretty old. It’s recommended to use 11.8 or 12.1 if you wanted to give it a shot.

BTW, would you mind trying out the dockerfile? It’s based on 12.1 and should work out of the box

I'm sorry I give the wrong nvidia-smi result.

I did all the commands in your README including building the docker. So the error occurs in CUDA 12.1.

In the docker container, here is the nvidia-smi result:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.85.02    Driver Version: 510.85.02    CUDA Version: 12.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  Off  | 00000000:18:00.0 Off |                    0 |
| N/A   44C    P0    56W / 300W |      0MiB / 32768MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
junrushao commented 11 months ago

We haven’t used V100 for a while, so there could be something we missed out or hardcode. Particularly you may want to change this line from 80 to 70 as well: https://github.com/mlc-ai/mlc-llm/blob/main/mlc_llm/core.py#L356

junrushao commented 11 months ago

Any updates?

sleepwalker2017 commented 11 months ago

Hi I'm in China, so we have a jet lag. I'll check it out today.

Any updates?

sleepwalker2017 commented 11 months ago
Start storing to cache ./dist/llama-2-7b-chat-hf-q4f16_1/params
[0327/0327] saving param_326
All finished, 115 total shards committed, record saved to ./dist/llama-2-7b-chat-hf-q4f16_1/params/ndarray-cache.json
Finish exporting chat config to ./dist/llama-2-7b-chat-hf-q4f16_1/params/mlc-chat-config.json
Traceback (most recent call last):
  File "/mlc-llm/build.py", line 4, in <module>
    main()
  File "/mlc-llm/mlc_llm/build.py", line 10, in main
    core.build_model_from_args(parsed_args)
  File "/mlc-llm/mlc_llm/core.py", line 447, in build_model_from_args
    mod = mod_transform_before_build(mod, param_manager, args)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mlc-llm/mlc_llm/core.py", line 305, in mod_transform_before_build
    mod = relax.transform.RunCodegen(
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mlc-llm/3rdparty/tvm/python/tvm/ir/transform.py", line 238, in __call__
    return _ffi_transform_api.RunPass(self, mod)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mlc-llm/3rdparty/tvm/python/tvm/_ffi/_ctypes/packed_func.py", line 238, in __call__
    raise get_last_ffi_error()
tvm._ffi.base.TVMError: Traceback (most recent call last):
  6: 0x0000560ba2c3c35c
  5: __libc_start_main
  4: 0x00007f247ef2fd8f
  3: unpack_call<tvm::IRModule, 2, tvm::transform::RunCodegen(tvm::runtime::Optional<tvm::runtime::Map<tvm::runtime::String, tvm::runtime::Map<tvm::runtime::String, tvm::runtime::ObjectRef> > >, tvm::runtime::Array<tvm::runtime::String>)::<lambda(tvm::IRModule, tvm::transform::PassContext)> >
        at /mlc-llm/3rdparty/tvm/src/relax/transform/run_codegen.cc:187
  2: tvm::relax::CodeGenRunner::Run(tvm::runtime::Optional<tvm::runtime::Map<tvm::runtime::String, tvm::runtime::Map<tvm::runtime::String, tvm::runtime::ObjectRef, void, void>, void, void> >, tvm::runtime::Array<tvm::runtime::String, void>)
        at /mlc-llm/3rdparty/tvm/src/relax/transform/run_codegen.cc:50
  1: tvm::relax::CodeGenRunner::InvokeCodegen(tvm::IRModule, tvm::runtime::Map<tvm::runtime::String, tvm::runtime::Map<tvm::runtime::String, tvm::runtime::ObjectRef, void, void>, void, void>)
        at /mlc-llm/3rdparty/tvm/src/relax/transform/run_codegen.cc:167
  0: tvm::relax::contrib::CUTLASSCompiler(tvm::runtime::Array<tvm::relax::Function, void>, tvm::runtime::Map<tvm::runtime::String, tvm::runtime::ObjectRef, void, void>, tvm::runtime::Map<tvm::relax::Constant, tvm::runtime::String, void, void>)
        at /mlc-llm/3rdparty/tvm/src/relax/backend/contrib/cutlass/codegen.cc:284
  File "/mlc-llm/3rdparty/tvm/python/tvm/_ffi/_ctypes/packed_func.py", line 82, in cfun
    rv = local_pyfunc(*pyargs)
  File "/mlc-llm/3rdparty/tvm/python/tvm/contrib/cutlass/build.py", line 980, in profile_relax_function
         ^^^^^^^^^^^^^^^^^^^^^
    conv2d_profiler = CutlassConv2DProfiler(sm, _get_cutlass_path(), tmp_dir)
  File "/mlc-llm/3rdparty/tvm/python/tvm/contrib/cutlass/gen_conv2d.py", line 186, in __init__
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    self.gemm_profiler = CutlassGemmProfiler(sm, cutlass_path, binary_path)
  File "/mlc-llm/3rdparty/tvm/python/tvm/contrib/cutlass/gen_gemm.py", line 197, in __init__
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    assert sm in GENERATOR_FUNC_TABLE and sm in DEFAULT_KERNELS, f"sm{sm} not supported yet."
                                          ^^^^^^^^^^^^^^^^^^^^^
AssertionError: sm70 not supported yet.

seems sm_70 is not supported, why is that? @junrushao

junrushao commented 11 months ago

Use sm_75 they are the same anyways

sleepwalker2017 commented 11 months ago

OK I'll try that.

BTW, I run the repo docker build on T4 with this error:

Step 3/10 : RUN grep -v '[ -z "\$PS1" ] && return' ~/.bashrc >/tmp/bashrc                                 &&     mv /tmp/bashrc ~/.bashrc                                                                  &&     echo "export MLC_HOME=/mlc_llm/" >>~/.bashrc                                              &&     echo "export PATH=/usr/local/cuda/bin/:\$PATH" >>~/.bashrc                                &&     ln -s /usr/local/cuda/lib64/stubs/libcuda.so /usr/lib/x86_64-linux-gnu/libcuda.so.1       &&     apt update                                                                                &&     apt install --yes wget curl git vim build-essential openssh-server
 ---> Running in e723bd0574b8
ln: failed to create symbolic link '/usr/lib/x86_64-linux-gnu/libcuda.so.1': File exists
The command '/bin/bash -ec grep -v '[ -z "\$PS1" ] && return' ~/.bashrc >/tmp/bashrc                                 &&     mv /tmp/bashrc ~/.bashrc                                                                  &&     echo "export MLC_HOME=/mlc_llm/" >>~/.bashrc                                              &&     echo "export PATH=/usr/local/cuda/bin/:\$PATH" >>~/.bashrc                                &&     ln -s /usr/local/cuda/lib64/stubs/libcuda.so /usr/lib/x86_64-linux-gnu/libcuda.so.1       &&     apt update                                                                                &&     apt install --yes wget curl git vim build-essential openssh-server' returned a non-zero code: 1

Here is my T4

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.125.06   Driver Version: 525.125.06   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            On   | 00000000:12:00.0 Off |                    0 |
| N/A   65C    P8    18W /  70W |      2MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla T4            On   | 00000000:13:00.0 Off |                    0 |
| N/A   71C    P8    19W /  70W |      2MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |

@junrushao

sleepwalker2017 commented 11 months ago

Use sm_75 they are the same anyways

SM 75 compiles ok, but still no kernel image found.

V100 CC is 70, so .. seems the SM75 can't be loaded.

Any new feature in tvm results to this limit?

junrushao commented 11 months ago
Step 3/10 : RUN grep -v '[ -z "\$PS1" ] && return' ~/.bashrc >/tmp/bashrc                                 &&     mv /tmp/bashrc ~/.bashrc                                                                  &&     echo "export MLC_HOME=/mlc_llm/" >>~/.bashrc                                              &&     echo "export PATH=/usr/local/cuda/bin/:\$PATH" >>~/.bashrc                                &&     ln -s /usr/local/cuda/lib64/stubs/libcuda.so /usr/lib/x86_64-linux-gnu/libcuda.so.1       &&     apt update                                                                                &&     apt install --yes wget curl git vim build-essential openssh-server
 ---> Running in e723bd0574b8
ln: failed to create symbolic link '/usr/lib/x86_64-linux-gnu/libcuda.so.1': File exists

Could you please remove the line containing ln -s?

junrushao commented 11 months ago

SM 75 compiles ok, but still no kernel image found.

If you have everything properly configured, then the only possibility is that cutlass doesn't support sm_70 well and the only thing you could do is to disable it, which will inevitably lead to mild performance loss:

python build.py \
  --model ~/models/Llama-2/hf/Llama-2-13b-chat-hf \
  --target cuda \
  --no-cutlass-attn -no-cutlassnorm \
  --quantization q4f16_1 \
  --artifact-path "./dist" \
  --use-cache 0

But on the other hand, I feel strongly that you really need to check your CUDA set up as I occasionally note some glitches, for example, you have a CUDA 12.0 on T4, having 11.6 outside Docker and 12.1 inside the Docker. I'm not sure which TVM version you installed (the default image uses 12.1). This is a red flag to me. Perhaps you may want to have the same CUDA version everywhere, and please make sure it's either of 11.8 or 12.1.

sleepwalker2017 commented 11 months ago

Another error :

WARNING: apt does not have a stable CLI interface. Use with caution in scripts.

Get:1 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease [1581 B]
Get:2 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages [456 kB]
Get:3 http://archive.ubuntu.com/ubuntu jammy InRelease [270 kB]
Get:4 http://security.ubuntu.com/ubuntu jammy-security InRelease [110 kB]
Get:5 http://security.ubuntu.com/ubuntu jammy-security/universe amd64 Packages [979 kB]
Get:6 http://archive.ubuntu.com/ubuntu jammy-updates InRelease [119 kB]
Get:7 http://archive.ubuntu.com/ubuntu jammy-backports InRelease [109 kB]
Get:8 http://archive.ubuntu.com/ubuntu jammy/main amd64 Packages [1792 kB]
Get:9 http://security.ubuntu.com/ubuntu jammy-security/multiverse amd64 Packages [44.0 kB]
Get:10 http://security.ubuntu.com/ubuntu jammy-security/restricted amd64 Packages [848 kB]
Get:11 http://security.ubuntu.com/ubuntu jammy-security/main amd64 Packages [833 kB]
Get:12 http://archive.ubuntu.com/ubuntu jammy/restricted amd64 Packages [164 kB]
Get:13 http://archive.ubuntu.com/ubuntu jammy/universe amd64 Packages [17.5 MB]
Get:14 http://archive.ubuntu.com/ubuntu jammy/multiverse amd64 Packages [266 kB]
Get:15 http://archive.ubuntu.com/ubuntu jammy-updates/restricted amd64 Packages [864 kB]
Get:16 http://archive.ubuntu.com/ubuntu jammy-updates/multiverse amd64 Packages [49.8 kB]
Get:17 http://archive.ubuntu.com/ubuntu jammy-updates/universe amd64 Packages [1235 kB]
Get:18 http://archive.ubuntu.com/ubuntu jammy-updates/main amd64 Packages [1103 kB]
Get:19 http://archive.ubuntu.com/ubuntu jammy-backports/main amd64 Packages [49.2 kB]
Get:20 http://archive.ubuntu.com/ubuntu jammy-backports/universe amd64 Packages [25.6 kB]
Fetched 26.8 MB in 4s (5959 kB/s)
Reading package lists...
E: Problem executing scripts APT::Update::Post-Invoke 'rm -f /var/cache/apt/archives/*.deb /var/cache/apt/archives/partial/*.deb /var/cache/apt/*.bin || true'
E: Sub-process returned an error code
The command '/bin/bash -ec grep -v '[ -z "\$PS1" ] && return' ~/.bashrc >/tmp/bashrc                                 &&     mv /tmp/bashrc ~/.bashrc                                                                  &&     echo "export MLC_HOME=/mlc_llm/" >>~/.bashrc                                              &&     echo "export PATH=/usr/local/cuda/bin/:\$PATH" >>~/.bashrc                                &&     apt update                                                                                &&     apt install --yes wget curl git vim build-essential openssh-server' returned a non-zero code: 100

I tried another Dockerfile from another project on my machine, it runs ok. weird.

sleepwalker2017 commented 11 months ago

CUDA 12.0 on T4,

Maybe I didn't express clearly, V100 and T4 are on two separate machines.

On T4 machine, the CUDA version is 12.0.
On V100 it's 11.6.
Both are real machine, not inside the docker.

junrushao commented 11 months ago

CUDA 12.0 on T4,

Maybe I didn't express clearly, V100 and T4 are on two separate machines.

On T4 machine, the CUDA version is 12.0. On V100 it's 11.6. Both are real machine, not inside the docker.

I think you get the idea that all cuda versions need to be the same, including pip package

junrushao commented 11 months ago

I tried another Dockerfile on my machine, it runs ok. weird.

It’s the network issue. You are probably blocked.

junrushao commented 11 months ago

Closing as this issue seems stale for a while. Please feel free to open a new one if the problem persists.