meta-llama / llama-stack

Model components of the Llama Stack APIs
MIT License
3.68k stars 494 forks source link

Llama3.2-1B only reply "<|end_of_text|>" #245

Open tw40210 opened 3 days ago

tw40210 commented 3 days ago

Hi Expert, I just tried to to install llama-stack and run the test with Llama3.2-1B but I found the response is really weird. Since my GPU RAM is only 6GB, I can't try bigger model to see if its the problem of "Llama3.2-1B". Just want to make sure I didn't miss anything in the "get start" document. Could you kindly help point out anything I might get wrong to lead this result? Thank you very much!

My install:

git clone git@github.com:meta-llama/llama-stack.git

conda create -n stack python=3.10
conda activate stack
llama stack build
> Enter an unique name for identifying your Llama Stack build distribution (e.g. my-local-stack): my-local
> Enter the image type you want your distribution to be built with (docker or conda): conda

 Llama Stack is composed of several APIs working together. Let's configure the providers (implementations) you want to use for these APIs.
> Enter the API provider for the inference API: (default=meta-reference): meta-reference
> Enter the API provider for the safety API: (default=meta-reference): meta-reference
> Enter the API provider for the agents API: (default=meta-reference): meta-reference
> Enter the API provider for the memory API: (default=meta-reference): meta-reference
> Enter the API provider for the telemetry API: (default=meta-reference): meta-reference
llama stack configure my_local
Could not find my_local. Trying conda build name instead...
Configuration already exists at `/home/ivan/.llama/builds/conda/my_local-run.yaml`. Will overwrite...
Configuring API `inference`...
=== Configuring provider `meta-reference` for API inference...
Enter value for model (default: Llama3.1-8B-Instruct) (required): Llama3.2-1B            
Do you want to configure quantization? (y/n): n
Enter value for torch_seed (optional): 
Enter value for max_seq_len (default: 4096) (required): 
Enter value for max_batch_size (default: 1) (required): 

Configuring API `safety`...
=== Configuring provider `meta-reference` for API safety...
Do you want to configure llama_guard_shield? (y/n): n
Enter value for enable_prompt_guard (default: False) (optional): 

Configuring API `agents`...
=== Configuring provider `meta-reference` for API agents...
Enter `type` for persistence_store (options: redis, sqlite, postgres) (default: sqlite): 

Configuring SqliteKVStoreConfig:
Enter value for namespace (optional): 
Enter value for db_path (existing: /home/ivan/.llama/runtime/kvstore.db) (required): 

Configuring API `memory`...
=== Configuring provider `meta-reference` for API memory...
> Please enter the supported memory bank type your provider has for memory: vector

Configuring API `telemetry`...
=== Configuring provider `meta-reference` for API telemetry...
llama stack run my_local --disable-ipv6

Test

python -m llama_stack.apis.inference.client localhost 5000  --model=Llama3.2-1B

User>hello world, write me a 2 sentence poem about the moon
Assistant> <|end_of_text|>

My OS and GPU

PRETTY_NAME="Ubuntu 22.04.5 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.5 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=jammy

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.107.02             Driver Version: 550.107.02     CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce GTX 1660 Ti     Off |   00000000:01:00.0  On |                  N/A |
| N/A   80C    P0             28W /   80W |    3407MiB /   6144MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A      1847      G   /usr/bin/gnome-shell                            1MiB |
|    0   N/A  N/A      9258      C   ...envs/llamastack-my_local/bin/python       3352MiB |
+-----------------------------------------------------------------------------------------+
AnthonyUphof-zacailab commented 2 days ago

Experiencing the same issue using llama stack build (distribution docker) using model Llama3.2-1B. However ollama run Llama3.2-1B works fine ✅

following is a snippet of the Docker logs, showing model loaded

docker run --gpus=all -it -p 5500:5500 -v /home/<USER>/.llama/builds/docker/my-llama32-1b-docker-run.yaml:/app/config.yaml -v /home/<USER>/.llama:/root/.llama llamastack-my-llama32-1b-docker python -m llama_stack.distribution.server.server --yaml_config /app/config.yaml --port 5500
Resolved 12 providers
 inner-inference => meta-reference
 models => __routing_table__
 inference => __autorouted__
 inner-safety => meta-reference
 shields => __routing_table__
 safety => __autorouted__
 inner-memory => meta-reference
 memory_banks => __routing_table__
 memory => __autorouted__
 agents => meta-reference
 telemetry => meta-reference
 inspect => __builtin__

Loading model `Llama3.2-1B`
> initializing model parallel with size 1
> initializing ddp with size 1
> initializing pipeline with size 1
/usr/local/lib/python3.10/site-packages/torch/__init__.py:955: UserWarning: torch.set_default_tensor_type() is deprecated as of PyTorch 2.1, please use torch.set_default_dtype() and torch.set_default_device() as alternatives. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:432.)
  _C._set_default_tensor_type(t)
Loaded in 4.98 seconds
Finished model load b'{"payload":{"type":"ready_response"}}'

Nvidia-smi output

Mon Oct 14 19:31:18 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.58.02              Driver Version: 556.12         CUDA Version: 12.5     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Quadro T1000 with Max-Q ...    On  |   00000000:01:00.0  On |                  N/A |
| N/A   56C    P0             16W /   40W |    3825MiB /   4096MiB |     11%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A        22      C   /python3.10                                 N/A      |
+-----------------------------------------------------------------------------------------+
yanxi0830 commented 14 hours ago

Could you try the Instruct (Chat) model Llama3.2-1B-Instruct?

AnthonyUphof-zacailab commented 11 hours ago

@yanxi0830 I can confirm Success with model Llama3.2-1B-Instruct and using the inference example https://github.com/meta-llama/llama-stack-client-python/blob/main/examples/inference/client.py ( modified for the model)

image