Closed sleepwalker2017 closed 11 months ago
My GPU and CUDA are as below
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.85.02 Driver Version: 510.85.02 CUDA Version: 11.6 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... Off | 00000000:18:00.0 Off | 0 |
| N/A 44C P0 56W / 300W | 0MiB / 32768MiB | 0% Default |
| | | N/A |
root@mlc-perf:/mlc-llm# micromamba activate python311
(python311) root@mlc-perf:/mlc-llm# sh model_build.sh /models/llama-2-7b-chat-hf
Using path "/models/llama-2-7b-chat-hf" for model "llama-2-7b-chat-hf"
Database paths: ['log_db/vicuna-v1-7b', 'log_db/redpajama-3b-q4f16', 'log_db/rwkv-raven-1b5', 'log_db/redpajama-3b-q4f32', 'log_db/rwkv-raven-7b', 'log_db/rwkv-raven-3b', 'log_db/dolly-v2-3b']
Target configured: cuda -keys=cuda,gpu -arch=sm_70 -max_num_threads=1024 -thread_warp_size=32
Automatically using target for weight quantization: cuda -keys=cuda,gpu -arch=sm_70 -max_num_threads=1024 -max_shared_memory_per_block=49152 -max_threads_per_block=1024 -registers_per_block=65536 -thread_warp_size=32
Start computing and quantizing weights... This may take a while.
Finish computing and quantizing weights.
Total param size: 3.5313796997070312 GB
Start storing to cache ./dist/llama-2-7b-chat-hf-q4f16_1/params
[0327/0327] saving param_326
All finished, 115 total shards committed, record saved to ./dist/llama-2-7b-chat-hf-q4f16_1/params/ndarray-cache.json
Finish exporting chat config to ./dist/llama-2-7b-chat-hf-q4f16_1/params/mlc-chat-config.json
[07:14:28] /mlc-llm/3rdparty/tvm/include/tvm/topi/transform.h:1076: Warning: Fast mode segfaults when there are out-of-bounds indices. Make sure input indices are in bound
[07:14:29] /mlc-llm/3rdparty/tvm/include/tvm/topi/transform.h:1076: Warning: Fast mode segfaults when there are out-of-bounds indices. Make sure input indices are in bound
Save a cached module to ./dist/llama-2-7b-chat-hf-q4f16_1/mod_cache_before_build.pkl.
Finish exporting to ./dist/llama-2-7b-chat-hf-q4f16_1/llama-2-7b-chat-hf-q4f16_1-cuda.so
the build log
Your cuda version is 11.6 which is pretty old. It’s recommended to use 11.8 or 12.1 if you wanted to give it a shot.
BTW, would you mind trying out the dockerfile? It’s based on 12.1 and should work out of the box
Your cuda version is 11.6 which is pretty old. It’s recommended to use 11.8 or 12.1 if you wanted to give it a shot.
BTW, would you mind trying out the dockerfile? It’s based on 12.1 and should work out of the box
I'm sorry I give the wrong nvidia-smi result.
I did all the commands in your README including building the docker. So the error occurs in CUDA 12.1.
In the docker container, here is the nvidia-smi result:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.85.02 Driver Version: 510.85.02 CUDA Version: 12.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... Off | 00000000:18:00.0 Off | 0 |
| N/A 44C P0 56W / 300W | 0MiB / 32768MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
We haven’t used V100 for a while, so there could be something we missed out or hardcode. Particularly you may want to change this line from 80 to 70 as well: https://github.com/mlc-ai/mlc-llm/blob/main/mlc_llm/core.py#L356
Any updates?
Hi I'm in China, so we have a jet lag. I'll check it out today.
Any updates?
Start storing to cache ./dist/llama-2-7b-chat-hf-q4f16_1/params
[0327/0327] saving param_326
All finished, 115 total shards committed, record saved to ./dist/llama-2-7b-chat-hf-q4f16_1/params/ndarray-cache.json
Finish exporting chat config to ./dist/llama-2-7b-chat-hf-q4f16_1/params/mlc-chat-config.json
Traceback (most recent call last):
File "/mlc-llm/build.py", line 4, in <module>
main()
File "/mlc-llm/mlc_llm/build.py", line 10, in main
core.build_model_from_args(parsed_args)
File "/mlc-llm/mlc_llm/core.py", line 447, in build_model_from_args
mod = mod_transform_before_build(mod, param_manager, args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mlc-llm/mlc_llm/core.py", line 305, in mod_transform_before_build
mod = relax.transform.RunCodegen(
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mlc-llm/3rdparty/tvm/python/tvm/ir/transform.py", line 238, in __call__
return _ffi_transform_api.RunPass(self, mod)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mlc-llm/3rdparty/tvm/python/tvm/_ffi/_ctypes/packed_func.py", line 238, in __call__
raise get_last_ffi_error()
tvm._ffi.base.TVMError: Traceback (most recent call last):
6: 0x0000560ba2c3c35c
5: __libc_start_main
4: 0x00007f247ef2fd8f
3: unpack_call<tvm::IRModule, 2, tvm::transform::RunCodegen(tvm::runtime::Optional<tvm::runtime::Map<tvm::runtime::String, tvm::runtime::Map<tvm::runtime::String, tvm::runtime::ObjectRef> > >, tvm::runtime::Array<tvm::runtime::String>)::<lambda(tvm::IRModule, tvm::transform::PassContext)> >
at /mlc-llm/3rdparty/tvm/src/relax/transform/run_codegen.cc:187
2: tvm::relax::CodeGenRunner::Run(tvm::runtime::Optional<tvm::runtime::Map<tvm::runtime::String, tvm::runtime::Map<tvm::runtime::String, tvm::runtime::ObjectRef, void, void>, void, void> >, tvm::runtime::Array<tvm::runtime::String, void>)
at /mlc-llm/3rdparty/tvm/src/relax/transform/run_codegen.cc:50
1: tvm::relax::CodeGenRunner::InvokeCodegen(tvm::IRModule, tvm::runtime::Map<tvm::runtime::String, tvm::runtime::Map<tvm::runtime::String, tvm::runtime::ObjectRef, void, void>, void, void>)
at /mlc-llm/3rdparty/tvm/src/relax/transform/run_codegen.cc:167
0: tvm::relax::contrib::CUTLASSCompiler(tvm::runtime::Array<tvm::relax::Function, void>, tvm::runtime::Map<tvm::runtime::String, tvm::runtime::ObjectRef, void, void>, tvm::runtime::Map<tvm::relax::Constant, tvm::runtime::String, void, void>)
at /mlc-llm/3rdparty/tvm/src/relax/backend/contrib/cutlass/codegen.cc:284
File "/mlc-llm/3rdparty/tvm/python/tvm/_ffi/_ctypes/packed_func.py", line 82, in cfun
rv = local_pyfunc(*pyargs)
File "/mlc-llm/3rdparty/tvm/python/tvm/contrib/cutlass/build.py", line 980, in profile_relax_function
^^^^^^^^^^^^^^^^^^^^^
conv2d_profiler = CutlassConv2DProfiler(sm, _get_cutlass_path(), tmp_dir)
File "/mlc-llm/3rdparty/tvm/python/tvm/contrib/cutlass/gen_conv2d.py", line 186, in __init__
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
self.gemm_profiler = CutlassGemmProfiler(sm, cutlass_path, binary_path)
File "/mlc-llm/3rdparty/tvm/python/tvm/contrib/cutlass/gen_gemm.py", line 197, in __init__
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
assert sm in GENERATOR_FUNC_TABLE and sm in DEFAULT_KERNELS, f"sm{sm} not supported yet."
^^^^^^^^^^^^^^^^^^^^^
AssertionError: sm70 not supported yet.
seems sm_70 is not supported, why is that? @junrushao
Use sm_75 they are the same anyways
OK I'll try that.
BTW, I run the repo docker build on T4 with this error:
Step 3/10 : RUN grep -v '[ -z "\$PS1" ] && return' ~/.bashrc >/tmp/bashrc && mv /tmp/bashrc ~/.bashrc && echo "export MLC_HOME=/mlc_llm/" >>~/.bashrc && echo "export PATH=/usr/local/cuda/bin/:\$PATH" >>~/.bashrc && ln -s /usr/local/cuda/lib64/stubs/libcuda.so /usr/lib/x86_64-linux-gnu/libcuda.so.1 && apt update && apt install --yes wget curl git vim build-essential openssh-server
---> Running in e723bd0574b8
ln: failed to create symbolic link '/usr/lib/x86_64-linux-gnu/libcuda.so.1': File exists
The command '/bin/bash -ec grep -v '[ -z "\$PS1" ] && return' ~/.bashrc >/tmp/bashrc && mv /tmp/bashrc ~/.bashrc && echo "export MLC_HOME=/mlc_llm/" >>~/.bashrc && echo "export PATH=/usr/local/cuda/bin/:\$PATH" >>~/.bashrc && ln -s /usr/local/cuda/lib64/stubs/libcuda.so /usr/lib/x86_64-linux-gnu/libcuda.so.1 && apt update && apt install --yes wget curl git vim build-essential openssh-server' returned a non-zero code: 1
Here is my T4
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.125.06 Driver Version: 525.125.06 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla T4 On | 00000000:12:00.0 Off | 0 |
| N/A 65C P8 18W / 70W | 2MiB / 15360MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Tesla T4 On | 00000000:13:00.0 Off | 0 |
| N/A 71C P8 19W / 70W | 2MiB / 15360MiB | 0% Default |
| | | N/A |
@junrushao
Use sm_75 they are the same anyways
SM 75 compiles ok, but still no kernel image found.
V100 CC is 70, so .. seems the SM75 can't be loaded.
Any new feature in tvm results to this limit?
Step 3/10 : RUN grep -v '[ -z "\$PS1" ] && return' ~/.bashrc >/tmp/bashrc && mv /tmp/bashrc ~/.bashrc && echo "export MLC_HOME=/mlc_llm/" >>~/.bashrc && echo "export PATH=/usr/local/cuda/bin/:\$PATH" >>~/.bashrc && ln -s /usr/local/cuda/lib64/stubs/libcuda.so /usr/lib/x86_64-linux-gnu/libcuda.so.1 && apt update && apt install --yes wget curl git vim build-essential openssh-server
---> Running in e723bd0574b8
ln: failed to create symbolic link '/usr/lib/x86_64-linux-gnu/libcuda.so.1': File exists
Could you please remove the line containing ln -s
?
SM 75 compiles ok, but still no kernel image found.
If you have everything properly configured, then the only possibility is that cutlass doesn't support sm_70 well and the only thing you could do is to disable it, which will inevitably lead to mild performance loss:
python build.py \
--model ~/models/Llama-2/hf/Llama-2-13b-chat-hf \
--target cuda \
--no-cutlass-attn -no-cutlassnorm \
--quantization q4f16_1 \
--artifact-path "./dist" \
--use-cache 0
But on the other hand, I feel strongly that you really need to check your CUDA set up as I occasionally note some glitches, for example, you have a CUDA 12.0 on T4, having 11.6 outside Docker and 12.1 inside the Docker. I'm not sure which TVM version you installed (the default image uses 12.1). This is a red flag to me. Perhaps you may want to have the same CUDA version everywhere, and please make sure it's either of 11.8 or 12.1.
Another error :
WARNING: apt does not have a stable CLI interface. Use with caution in scripts.
Get:1 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64 InRelease [1581 B]
Get:2 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64 Packages [456 kB]
Get:3 http://archive.ubuntu.com/ubuntu jammy InRelease [270 kB]
Get:4 http://security.ubuntu.com/ubuntu jammy-security InRelease [110 kB]
Get:5 http://security.ubuntu.com/ubuntu jammy-security/universe amd64 Packages [979 kB]
Get:6 http://archive.ubuntu.com/ubuntu jammy-updates InRelease [119 kB]
Get:7 http://archive.ubuntu.com/ubuntu jammy-backports InRelease [109 kB]
Get:8 http://archive.ubuntu.com/ubuntu jammy/main amd64 Packages [1792 kB]
Get:9 http://security.ubuntu.com/ubuntu jammy-security/multiverse amd64 Packages [44.0 kB]
Get:10 http://security.ubuntu.com/ubuntu jammy-security/restricted amd64 Packages [848 kB]
Get:11 http://security.ubuntu.com/ubuntu jammy-security/main amd64 Packages [833 kB]
Get:12 http://archive.ubuntu.com/ubuntu jammy/restricted amd64 Packages [164 kB]
Get:13 http://archive.ubuntu.com/ubuntu jammy/universe amd64 Packages [17.5 MB]
Get:14 http://archive.ubuntu.com/ubuntu jammy/multiverse amd64 Packages [266 kB]
Get:15 http://archive.ubuntu.com/ubuntu jammy-updates/restricted amd64 Packages [864 kB]
Get:16 http://archive.ubuntu.com/ubuntu jammy-updates/multiverse amd64 Packages [49.8 kB]
Get:17 http://archive.ubuntu.com/ubuntu jammy-updates/universe amd64 Packages [1235 kB]
Get:18 http://archive.ubuntu.com/ubuntu jammy-updates/main amd64 Packages [1103 kB]
Get:19 http://archive.ubuntu.com/ubuntu jammy-backports/main amd64 Packages [49.2 kB]
Get:20 http://archive.ubuntu.com/ubuntu jammy-backports/universe amd64 Packages [25.6 kB]
Fetched 26.8 MB in 4s (5959 kB/s)
Reading package lists...
E: Problem executing scripts APT::Update::Post-Invoke 'rm -f /var/cache/apt/archives/*.deb /var/cache/apt/archives/partial/*.deb /var/cache/apt/*.bin || true'
E: Sub-process returned an error code
The command '/bin/bash -ec grep -v '[ -z "\$PS1" ] && return' ~/.bashrc >/tmp/bashrc && mv /tmp/bashrc ~/.bashrc && echo "export MLC_HOME=/mlc_llm/" >>~/.bashrc && echo "export PATH=/usr/local/cuda/bin/:\$PATH" >>~/.bashrc && apt update && apt install --yes wget curl git vim build-essential openssh-server' returned a non-zero code: 100
I tried another Dockerfile from another project on my machine, it runs ok. weird.
CUDA 12.0 on T4,
Maybe I didn't express clearly, V100 and T4 are on two separate machines.
On T4 machine, the CUDA version is 12.0.
On V100 it's 11.6.
Both are real machine, not inside the docker.
CUDA 12.0 on T4,
Maybe I didn't express clearly, V100 and T4 are on two separate machines.
On T4 machine, the CUDA version is 12.0. On V100 it's 11.6. Both are real machine, not inside the docker.
I think you get the idea that all cuda versions need to be the same, including pip package
I tried another Dockerfile on my machine, it runs ok. weird.
It’s the network issue. You are probably blocked.
Closing as this issue seems stale for a while. Please feel free to open a new one if the problem persists.
Hi I use the commands in the README to run this project.
As you don't specify the model repo, I download model from here: https://huggingface.co/daryl149/llama-2-7b-chat-hf
The build process is ok. But when I run the built model, it complains: