Closed AegeanYan closed 1 year ago
Thanks for the question! In theory, local CUDA GPU arch is detected using the same API as nvidia-smi (i.e. cuda runtime). If this is the case, just wanted to check some environment setups.
I'm sorry that I'm just using the plain docker, I'll try again.
I've not got connection to my server manager to install nvidia-docker. But before that, could you please tell me whether your quantization method is possible for my to deploy llama-1 30B on 24G GPU? And whether it's possible for llama-2 70B on 2x24G GPUs?
We are benchmarking on Llama2 so no guarantees for Llama1-30B (I suppose it should work), but Llama2-7B/13B should work out of box. Distributed inference is work in progress but around the horizon.
Thx!
@junrushao Hi, junru. I've check that.
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.125.06 Driver Version: 525.125.06 CUDA Version: 12.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:05:00.0 Off | N/A |
| 31% 42C P8 30W / 350W | 664MiB / 24576MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA GeForce ... Off | 00000000:06:00.0 Off | N/A |
| 31% 42C P8 23W / 350W | 664MiB / 24576MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 NVIDIA GeForce ... Off | 00000000:45:00.0 Off | N/A |
| 75% 70C P2 344W / 350W | 22470MiB / 24576MiB | 62% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 3 NVIDIA GeForce ... Off | 00000000:46:00.0 Off | N/A |
| 73% 69C P2 345W / 350W | 23518MiB / 24576MiB | 65% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 4 NVIDIA GeForce ... Off | 00000000:85:00.0 Off | N/A |
| 30% 31C P8 21W / 350W | 8MiB / 24576MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 5 NVIDIA GeForce ... Off | 00000000:86:00.0 Off | N/A |
| 30% 28C P8 23W / 350W | 8MiB / 24576MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 6 NVIDIA GeForce ... Off | 00000000:C5:00.0 Off | N/A |
| 99% 87C P2 249W / 350W | 22454MiB / 24576MiB | 69% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 7 NVIDIA GeForce ... Off | 00000000:C6:00.0 Off | N/A |
| 63% 63C P2 311W / 350W | 22604MiB / 24576MiB | 69% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| +-----------------------------------------------------------------------------+
5.
root@llm-perf:/mlc_llm# ldd build/mlc_chat_cli linux-vdso.so.1 (0x00007ffc212ce000) libmlc_llm.so => /mlc_llm/build/libmlc_llm.so (0x00007f1bfb25c000) libtvm_runtime.so => /mlc_llm/build/tvm/libtvm_runtime.so (0x00007f1bfb04c000) libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007f1bfae1c000) libgcc_s.so.1 => /usr/lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f1bfadfc000) libc.so.6 => /usr/lib/x86_64-linux-gnu/libc.so.6 (0x00007f1bfabd4000) libm.so.6 => /usr/lib/x86_64-linux-gnu/libm.so.6 (0x00007f1bfaaeb000) /lib64/ld-linux-x86-64.so.2 (0x00007f1bfbad9000) libcudart.so.12 => /usr/local/cuda/lib64/libcudart.so.12 (0x00007f1bfa800000) libcuda.so.1 => /usr/lib/x86_64-linux-gnu/libcuda.so.1 (0x00007f1bf8b17000) libdl.so.2 => /usr/lib/x86_64-linux-gnu/libdl.so.2 (0x00007f1bfaae6000) libpthread.so.0 => /usr/lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f1bfaae1000) librt.so.1 => /usr/lib/x86_64-linux-gnu/librt.so.1 (0x00007f1bfaada000)
could you help me to check what's going wrong?
Hi, I'm new in your work and I've build the docker and tried to run
python build.py
command and receive this:I'm not sure where goes wrong and I can see my GPUs using nvidia-smi. I'm so happy to try your work on accelerating the llama inference speed.