Running InferenceClient in docker for TGI web interface

danilyef commented 1 year ago

Describe the bug

I have succesfully deployed TGI (https://github.com/huggingface/text-generation-inference) with LLama-2 using the standard command:

model=meta-llama/Llama-2-7b-chat-hf
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run
token=<some_token>

docker run --gpus all --shm-size 1g -e HUGGING_FACE_HUB_TOKEN=$token -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:latest --model-id $model --sharded true --num-shard 2

TGI Backend for the model is running on the Ubuntu 18.04.

In order to test requests I used following commands:

directly on the server:

curl 127.0.0.1:8080/generate \
-X POST \
-d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \
-H 'Content-Type: application/json'

or

curl 152.12.0.1:8080/generate \
    -X POST \
    -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \
    -H 'Content-Type: application/json'

where 152.12.0.1 is a Gateway of the Backend docker image.

on local machine:

curl 152.20.147.36:8080/generate \
-X POST \
-d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \
-H 'Content-Type: application/json'

where 152.20.147.36 is an Ubuntu server IP address.

You can access API for testing requests from your local machine: http://152.20.147.36:8080/docs/#/

where 152.20.147.36 is an Ubuntu server IP address.

Every curl worked as expected.

What I want to do is to build a docker container for the web interface and run it on Ubuntu 18.04 (same environment as TGI), so that it can be accessible on the local machine (basically using 152.20.147.36:7860 IP address for accessing Web interface and port 7860). For this purposes I am using InferenceClient from huggingface_hub. Here is my script:

import gradio as gr
from huggingface_hub import InferenceClient

client = InferenceClient(model="http://152.12.0.1:8080")

SYSTEM_PROMPT = """<s>[INST] <<SYS>>
You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. 
Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. 
Please ensure that your responses are socially unbiased and positive in nature.
If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. 
If you don’t know the answer to a question, please don’t share false information.
<</SYS>>
"""

def format_message(message: str, history: list, memory_limit: int = 3) -> str:
    """
    Formats the message and history for the Llama model.

    Parameters:
        message (str): Current message to send.
        history (list): Past conversation history.
        memory_limit (int): Limit on how many past interactions to consider.

    Returns:
        str: Formatted message string
    """
    # always keep len(history) <= memory_limit

    if len(history) > memory_limit:
        history = history[-memory_limit:]

    if len(history) == 0:
        return SYSTEM_PROMPT + f"{message} [/INST]"

    # Takes the last user question (history[0][0]) and last model answer history[0][1] from the history list
    formatted_message = SYSTEM_PROMPT + f"{history[0][0]} [/INST] {history[0][1]} </s>"

    # Handle conversation history: takes from history list starting from second element to the memory_limit
    for user_msg, model_answer in history[1:]:
        formatted_message += f"<s>[INST] {user_msg} [/INST] {model_answer} </s>"

    # Handle the current message
    formatted_message += f"<s>[INST] {message} [/INST]"

    return formatted_message

def inference(message, history):

    try:
        query = format_message(message, history)

        partial_message = ""
        for token in client.text_generation(
            query,
            max_new_tokens=512,
            temperature=.1, 
            top_k=40, 
            top_p=.9, 
            repetition_penalty=1.18, 
            stream=True):
            partial_message += token
            yield partial_message
    except Exception as e:
        # Print the exception to the console for debugging
        print("Exception encountered:", str(e))
        # Optionally, you can yield a message to the user
        yield f"An Error occured please 'Clear' the error and try your question again"

# Create and modify the theme to use Teal
theme = gr.themes.Default(primary_hue="teal").set(
    loader_color="#008080",  # Teal color for loader
)

gr.ChatInterface(
    inference,
    chatbot=gr.Chatbot(height=475),
    textbox=gr.Textbox(placeholder="Please ask me a Question...", container=False, scale=7),
    description="I am a general knowledge Chatbot that uses LLaMA 7B-Chat model from Meta.",
    title="A-Team chat: how can I help you?",
    examples=["What is the meaning of life?"],
    retry_btn="Retry",
    undo_btn="Undo",
    clear_btn="Clear",
    theme=theme,  # Apply the theme here
).queue().launch()

where 152.12.0.1 is a is a Gateway of the Backend docker image. When I run docker image there is nor error, but unfortunately I cannot access web interface on my local machine ( using 152.20.147.36:7860).

But If I start the script locally without docker (python3 web.py), everything works well an I can use web interface using http://127.0.0.1:7860 route.

Reproduction

My Dockerfile:

# Use an official Python 3.9 image as the base image
FROM python:3.9

# Set the working directory to /app
WORKDIR /app

# Copy the requirements.txt file into the container at /app
COPY requirements.txt /app

# Install any needed packages specified in requirements.txt
RUN pip install -r requirements.txt

# Copy the web.py file into the container at /app
COPY web.py /app

# Make port 7860 available to the world outside this container
EXPOSE 7860

# Define the command to run your application
CMD ["python", "web.py"]

Docker commands:

docker build -t llama_chat .
docker run --security-opt seccomp:unconfined -p 7860:7860 -d llama_chat

requirements.txt

gradio==3.50.2
huggingface_hub==0.18.0

--security-opt seccomp:unconfined is a necessary workaround for Ubuntu 18.04, because otherwise docker run will give you errors. (https://medium.com/nttlabs/ubuntu-21-10-and-fedora-35-do-not-work-on-docker-20-10-9-1cd439d9921)

Logs

No errors in logs

System info

huggingface_hub: 0.18.0
OS: Ubuntu 18.04

Wauplin commented 1 year ago

Hi @danilyef, thanks for the detailed report. For what I understand, it looks like it is not a InferenceClient issue but something to fix on the docker configuration side. You need to give the llama_chat container access to the gateway (i.e. 152.12.0.1:8080), otherwise you won't be able to access it from within the container. That explains why the script works locally (python3 web.py) but not in the docker.

To check the network access is correct, try running

import requests

response =requests.get("http://152.12.0.1:8080")
response.raise_for_status()

at the very beginning of the web.py script. If it fails, it means the issue doesn't come from the python script itself but the docker config.

danilyef commented 1 year ago

@Wauplin thank you for your quick response. I added in my script your code, then built and ran my docker on Ubuntu 18.04 again.

docker logs didn't show anything.

I also entered the container and executed web.py script inside docker container and it works (see screenwhot) file_report

But still unfortunately not accessible on local machine. TGI API is accessible.

What I have noticed: when I executed script inside docker I noticed, that port is 7861 which isn't default (it should be 7860)

danilyef commented 1 year ago

I decided ti test other non-existing routes (like http://152.12.0.12:8080). Container throws and error and stops as expected.

Wauplin commented 1 year ago

thank you for your quick response. I added in my script your code, then built and ran my docker on Ubuntu 18.04 again. docker logs didn't show anything.

Hmm, ok then it really have access to it.

What I have noticed: when I executed script inside docker I noticed, that port is 7861 which isn't default (it should be 7860)

Could that be the problem? By default Gradio will launch on 7860 but if it is taken, it will try 7861, 7862, 7863,...

danilyef commented 1 year ago

@Wauplin I think it's not, because port 7860 is used already by container. netstat -tuln command shows me, that local adress 0.0.0.0:7860 is used (because container is running). But when I execute script inside container, it seems like it's taking another port.

danilyef commented 1 year ago

When I run curl command: curl http://127.0.0.1:7860 in order to check connection I get the following error: Recv failure: Connection reset by peer

Wauplin commented 1 year ago

So on your machine port 7860 is taken by the container because of docker run --security-opt seccomp:unconfined -p 7860:7860 -d llama_chat which is normal. But if inside the container the Gradio app starts on port 7861, there must be a reason for it. Can you try with gr.ChatInterface(...).queue().launch(server_port=7860) to explicitly force it to start on 7860 or raise an error?

danilyef commented 1 year ago

If I execute web.py inside docker I am getting the following error:

Traceback (most recent call last):
  File "/app/web.py", line 82, in <module>
    gr.ChatInterface(
  File "/usr/local/lib/python3.9/site-packages/gradio/blocks.py", line 2033, in launch
    ) = networking.start_server(
  File "/usr/local/lib/python3.9/site-packages/gradio/networking.py", line 207, in start_server
    raise OSError(
OSError: Cannot find empty port in range: 7860-7860. You can specify a different port by setting the GRADIO_SERVER_PORT environment variable or passing the `server_port` parameter to `launch()`.

Wauplin commented 1 year ago

Good, so this means this is the error. Any idea why this port is already in use in the container? Could you try

remove
```
# Make port 7860 available to the world outside this container
EXPOSE 7860
```
from the Dockerfile. From my understanding, this is not needed (+you are starting your container with -p 7860:7860).
try to start app on a random port (7888?) and start the container with -p 7888:7888. At least to be sure that nothing else is running on the port you want to use.

danilyef commented 1 year ago

I fixed the problem by setting server_name to "0.0.0.0" launch(server_name="0.0.0.0",server_port=7860)

Thank you guys for your quick responses, it guided me to the right direction:)

Wauplin commented 1 year ago

Good to hear! Wishing you a good continuation :hugs:

huggingface / huggingface_hub