huggingface / optimum-neuron

Easy, fast and very cheap training and inference on AWS Trainium and Inferentia chips.
Apache License 2.0
192 stars 59 forks source link

Cannot host Llama-3-8B exported by optimum-neuron with TGI contianer using optimum-neuron(0.0.24) and neuron-sdk(2.19.1) #684

Open cszhz opened 2 weeks ago

cszhz commented 2 weeks ago

System Info

AWS EC2 instance: trn1.32xlarge
OS: Ubuntu 22.04.4 LTS

Platform:

- Platform: Linux-6.5.0-1023-aws-x86_64-with-glibc2.35
- Python version: 3.10.12

Python packages:

- `optimum-neuron` version: 0.0.24
- `neuron-sdk` version: 2.19.1
- `optimum` version: 1.20.0
- `transformers` version: 4.41.1
- `huggingface_hub` version: 0.24.5
- `torch` version: 2.1.2+cu121
- `aws-neuronx-runtime-discovery` version: 2.9
- `libneuronxla` version: 2.0.2335
- `neuronx-cc` version: 2.14.227.0+2d4f85be
- `neuronx-distributed` version: 0.8.0
- `neuronx-hwm` version: NA
- `torch-neuronx` version: 2.1.2.2.1.0
- `torch-xla` version: 2.1.2
- `transformers-neuronx` version: 0.10.0.21

Who can help?

Inference @dacorvo, @JingyaHuang TGI @dacorvo

Information

Tasks

Reproduction (minimal, reproducible, runnable)

I confirm optimum-neuron version: 0.0.21 with Neuron 2.18.2 is working fine.

  1. export neuron model
    optimum-cli export neuron  --model NousResearch/Meta-Llama-3-8B-Instruct --batch_size 1 --sequence_length 1024 --num_cores 2 --auto_cast_type fp16  ./models/NousResearch/Meta-Llama-3-8B-Instruct
  2. build docker image
    git clone -b v0.0.24-release  https://github.com/huggingface/optimum-neuron
    cd optimum-neuron/
    make neuronx-tgi
  3. start docker
    docker run -it --name mytest --rm \
       -p 8080:80 \
       -v /home/ubuntu/work/models/:/models \
       -e HF_MODEL_ID=/models/NousResearch/Meta-Llama-3-8B-Instruct \
       -e MAX_INPUT_TOKENS=256 \
       -e MAX_TOTAL_TOKENS=1024 \
       -e MAX_BATCH_SIZE=1 \
       -e LOG_LEVEL="info,text_generation_router=debug,text_generation_launcher=debug" \
       --device=/dev/neuron0 \
       ${neuron_image_name} \
       --model-id /models/NousResearch/Meta-Llama-3-8B-Instruct \
       --max-batch-size 1 \
       --max-input-tokens 256 \
       --max-total-tokens 1024 

    After about 1 minutes, the server hangs

    2024-08-25T07:02:45.593647Z  WARN text_generation_router: router/src/main.rs:372: Invalid hostname, defaulting to 0.0.0.0
    2024-08-25T07:02:45.597333Z  INFO text_generation_router::server: router/src/server.rs:1613: Warming up model
    2024-08-25T07:02:45.597833Z DEBUG text_generation_launcher: Prefilling 1 new request(s) with 1 empty slot(s)
    2024-08-25T07:02:45.598003Z DEBUG text_generation_launcher: Request 0 assigned to slot 0
    2024-08-25T07:02:45.671381Z DEBUG text_generation_launcher: Model ready for decoding
    2024-08-25T07:02:45.671501Z  INFO text_generation_launcher: Removing slot 0 with request 0
    2024-08-25T07:02:45.671737Z  INFO text_generation_router::server: router/src/server.rs:1640: Using scheduler V2
    2024-08-25T07:02:45.671750Z  INFO text_generation_router::server: router/src/server.rs:1646: Setting max batch total tokens to 1024
    2024-08-25T07:02:45.740855Z  INFO text_generation_router::server: router/src/server.rs:1884: Connected

Expected behavior

The TGI server can be started normally.

dacorvo commented 2 weeks ago

@cszhz thank you for your feedback. According to your traces, the server started normally. What do you mean when you say it hangs ? What do you get when you query its URL using CURL or the huggingface_hub inference client ?

cszhz commented 1 week ago

Hi @dacorvo I don't think server started normally, In the previous 0.0.21 image, it worked fine. Here is the response from docker container 763104351884.dkr.ecr.us-west-2.amazonaws.com/huggingface-pytorch-tgi-inference:2.1.2-optimum0.0.24-neuronx-py310-ubuntu22.04

curl 127.0.0.1:8080/generate \
    -X POST \
    -d '{
  "inputs":"What is Deep Learning?",
  "parameters":{
    "max_new_tokens":20
  }
}' \
    -H 'Content-Type: application/json'
curl: (56) Recv failure: Connection reset by peer