Open tw40210 opened 3 days ago
Experiencing the same issue using llama stack build
(distribution docker) using model Llama3.2-1B
.
However ollama run Llama3.2-1B
works fine ✅
following is a snippet of the Docker logs, showing model loaded
docker run --gpus=all -it -p 5500:5500 -v /home/<USER>/.llama/builds/docker/my-llama32-1b-docker-run.yaml:/app/config.yaml -v /home/<USER>/.llama:/root/.llama llamastack-my-llama32-1b-docker python -m llama_stack.distribution.server.server --yaml_config /app/config.yaml --port 5500
Resolved 12 providers
inner-inference => meta-reference
models => __routing_table__
inference => __autorouted__
inner-safety => meta-reference
shields => __routing_table__
safety => __autorouted__
inner-memory => meta-reference
memory_banks => __routing_table__
memory => __autorouted__
agents => meta-reference
telemetry => meta-reference
inspect => __builtin__
Loading model `Llama3.2-1B`
> initializing model parallel with size 1
> initializing ddp with size 1
> initializing pipeline with size 1
/usr/local/lib/python3.10/site-packages/torch/__init__.py:955: UserWarning: torch.set_default_tensor_type() is deprecated as of PyTorch 2.1, please use torch.set_default_dtype() and torch.set_default_device() as alternatives. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:432.)
_C._set_default_tensor_type(t)
Loaded in 4.98 seconds
Finished model load b'{"payload":{"type":"ready_response"}}'
Nvidia-smi output
Mon Oct 14 19:31:18 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.58.02 Driver Version: 556.12 CUDA Version: 12.5 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 Quadro T1000 with Max-Q ... On | 00000000:01:00.0 On | N/A |
| N/A 56C P0 16W / 40W | 3825MiB / 4096MiB | 11% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 22 C /python3.10 N/A |
+-----------------------------------------------------------------------------------------+
Could you try the Instruct (Chat) model Llama3.2-1B-Instruct
?
@yanxi0830 I can confirm Success with model Llama3.2-1B-Instruct
and using the inference example https://github.com/meta-llama/llama-stack-client-python/blob/main/examples/inference/client.py ( modified for the model)
Hi Expert, I just tried to to install llama-stack and run the test with Llama3.2-1B but I found the response is really weird. Since my GPU RAM is only 6GB, I can't try bigger model to see if its the problem of "Llama3.2-1B". Just want to make sure I didn't miss anything in the "get start" document. Could you kindly help point out anything I might get wrong to lead this result? Thank you very much!
My install:
Test
My OS and GPU