huggingface / text-generation-inference

Large Language Model Text Generation Inference
http://hf.co/docs/text-generation-inference
Apache License 2.0
8.86k stars 1.04k forks source link

`stop` param doesn't work at all for `/v1/completions` endpoint #1999

Closed josephrocca closed 1 month ago

josephrocca commented 4 months ago

System Info

Official docker image, v2.0.4

Information

Tasks

Reproduction

Try using the stop param for the v1/completions endpoint. Note that this bug report is not related to this issue:

This bug report is about the stop param not working at all for the /v1/completions endpoint.

Expected behavior

It should work - currently it doesn't work at all. Ideally the same as OpenAI's behavior (i.e. no token boundary limitations - see above linked issue), but if that's not possible, then like /generate.

github-actions[bot] commented 3 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

josephrocca commented 3 months ago

Not stale

drbh commented 2 months ago

Hi @josephrocca thank you for opening this issues. I'm attempting to reproduce on main but am having some trouble.

TGI started with

docker run --shm-size=1gb --gpus all \
  -v /nvme0n1/Models/:/data \
  -e HUGGINGFACE_HUB_CACHE=/data \
  -e MODEL_ID=mistralai/Mistral-7B-Instruct-v0.1 \
  -e PORT=8080 \
  -p 3000:8080 \
  ghcr.io/huggingface/text-generation-inference:2.2.0

Completions call with and without stop value

from openai import OpenAI
import os

client = OpenAI(
    base_url="http://localhost:3000/v1",
    api_key=os.getenv("HF_TOKEN", "YOUR_API_KEY"),
)

SEED = 1337
PROMPT = "What are three words that describe the Python programming language?"
MODEL = "mistralai/Mistral-7B-Instruct-v0.1"

def without_stop():
    chat_completion = client.completions.create(
        model=MODEL,
        prompt=PROMPT,
        max_tokens=20,
        stream=False,
        seed=SEED,
    )

    generated = chat_completion.choices[0]
    print(generated)

def with_stop():
    chat_completion = client.completions.create(
        model=MODEL,
        prompt=PROMPT,
        max_tokens=20,
        stream=False,
        seed=SEED,
        stop=["-"],
    )

    generated = chat_completion.choices[0]
    print(generated)

without_stop()
with_stop()

response


Object-oriented, Versatile, Easy to learn.

Object-

Would you kindly try the latest build? If the issue persists would you be able to share a reproduction script? Thank you!

drbh commented 1 month ago

closing as just retested with meta-llama-3.1-8B-Instruct locally and stop is working as expected

from openai import OpenAI
import os

client = OpenAI(
    base_url="http://localhost:3000/v1",
    api_key=os.getenv("HF_TOKEN", "YOUR_API_KEY"),
)

SEED = 1337
PROMPT = "What are three words that describe the Python programming language?"
MODEL = "meta-llama/Meta-Llama-3.1-8B-Instruct"

def without_stop():
    chat_completion = client.completions.create(
        model=MODEL,
        prompt=PROMPT,
        max_tokens=10,
        stream=False,
        seed=SEED,
    )

    generated = chat_completion.choices[0]
    print("without_stop\t", generated.text)

def with_stop():
    chat_completion = client.completions.create(
        model=MODEL,
        prompt=PROMPT,
        max_tokens=10,
        stream=False,
        seed=SEED,
        stop=[","],
    )

    generated = chat_completion.choices[0]
    print("with_stop\t", generated.text)

without_stop()
with_stop()

# OUTPUT
# without_stop      (Pick one, two or three that you feel
# with_stop         (Pick one,