predibase / lorax

Multi-LoRA inference server that scales to 1000s of fine-tuned LLMs
https://loraexchange.ai
Apache License 2.0
2.19k stars 143 forks source link

Generating garbage output #521

Open shreyansh26 opened 4 months ago

shreyansh26 commented 4 months ago

System Info

Using Docker server

model=mistralai/Mistral-7B-Instruct-v0.1
volume=$PWD/data

docker run --gpus '"device=3"' --shm-size 1g -p 8080:80 -v $volume:/data \
    -e HUGGING_FACE_HUB_TOKEN=$HUGGING_FACE_HUB_TOKEN \
    ghcr.io/predibase/lorax:main --model-id $model

Running on a node with 8xH100 80GB GPUs. Here device 3 is completely empty and has no process running.

Information

Tasks

Reproduction

Launch Lorax server

model=mistralai/Mistral-7B-Instruct-v0.1
volume=$PWD/data

docker run --gpus '"device=3"' --shm-size 1g -p 8080:80 -v $volume:/data \
    -e HUGGING_FACE_HUB_TOKEN=$HUGGING_FACE_HUB_TOKEN \
    ghcr.io/predibase/lorax:main --model-id $model

Use lorax-client with Python to query the server.

from lorax import Client
import random
import time
import sys

client = Client("http://127.0.0.1:8080")

# Prompt the base LLM
prompt = "[INST] What is the capital of Portugal? [/INST]"
print(client.generate(prompt, max_new_tokens=64).generated_text)

This generates garbage output

Theaot of the9
 the00-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

And on the server side -

2024-06-19T15:19:59.434864Z  INFO HTTP request{otel.name=POST / http.client_ip= http.flavor=1.1 http.host=127.0.0.1:8080 http.method=POST http.route=/ http.scheme=HTTP http.target=/ http.user_agent=python-requests/2.31.0 otel.kind=server trace_id=a884cc60137b84e19b63b832ca233d42}:compat_generate{default_return_full_text=Extension(false) info=Extension(Info { model_id: "mistralai/Mistral-7B-Instruct-v0.1", model_sha: Some("73068f3702d050a2fd5aa2ca1e612e5036429398"), model_dtype: "torch.float16", model_device_type: "cuda", model_pipeline_tag: Some("text-generation"), max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_input_length: 1024, max_total_tokens: 2048, waiting_served_ratio: 1.2, max_batch_total_tokens: 453184, max_waiting_tokens: 20, validation_workers: 2, version: "0.1.0", sha: None, docker_label: None, request_logger_url: None, embedding_model: false }) request_logger_sender=Extension(Sender { chan: Tx { inner: Chan { tx: Tx { block_tail: 0x560439960630, tail_position: 0 }, semaphore: Semaphore { semaphore: Semaphore { permits: 32 }, bound: 32 }, rx_waker: AtomicWaker, tx_count: 1, rx_fields: "..." } } }) req_headers={"host": "127.0.0.1:8080", "user-agent": "python-requests/2.31.0", "accept-encoding": "gzip, deflate", "accept": "*/*", "connection": "keep-alive", "content-length": "562", "content-type": "application/json"}}:generate{parameters=GenerateParameters { adapter_id: None, adapter_source: None, adapter_parameters: None, api_token: None, best_of: None, temperature: None, repetition_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: false, max_new_tokens: Some(64), ignore_eos_token: false, return_full_text: Some(false), stop: [], truncate: None, watermark: false, details: true, decoder_input_details: false, return_k_alternatives: None, apply_chat_template: false, seed: None, response_format: None } total_time="706.83088ms" validation_time="302.036µs" queue_time="48.49µs" inference_time="706.480604ms" time_per_token="11.038759ms" seed="None"}: lorax_router::server: router/src/server.rs:590: Success

And this is not input related. Garbage values are generated with pretty much every prompt I tried.

Expected behavior

Using a simple HF inference script gives the expected output.

from transformers import AutoModelForCausalLM, AutoTokenizer

device = "cuda" # the device to load the model onto

tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.1")

prompt = "[INST] What is the capital of Portugal? [/INST]"

encodeds = tokenizer.encode(prompt, return_tensors="pt")
model_inputs = encodeds.to(device)

model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-Instruct-v0.1")
model.to(device)

generated_ids = model.generate(model_inputs, max_new_tokens=1000, do_sample=True)
decoded = tokenizer.batch_decode(generated_ids)
print(decoded[0])

Output

<s> [INST] What is the capital of Portugal? [/INST] The capital city of Portugal is Lisbon.</s>
shreyansh26 commented 4 months ago

Okay so it looks like using ghcr.io/predibase/lorax:latest fixes it. Probably an issue with the current latest main branch.

GirinMan commented 4 months ago

I'm facing a similar problem. When using image ghcr.io/predibase/lorax:main, I see garbage outputs. Older version of lorax or hf tgi is not creating such issue.

curl -X 'POST' \
  'http://127.0.0.1:50710/v1/chat/completions' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
    "messages": [
      {
        "role": "system",
        "content": "You are a helpful assistant."
      },
      {
        "role": "user",
        "content": "Who are you?"
      }],
    "model": "",
    "seed": 42,
    "max_tokens": 256, "temperature": 0.1
  }'
{
  "id":"null",
  "object":"text_completion",
  "created":0,
  "model":"null",
  "choices":[
    {"index":0,"message": 
      {
        "role":"assistant",
        "content":"I \n ____ \n\n코_김 \n코 a a \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n1\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n1 and the and the and the and the a and the the a a and the a a a a a a a a a a a a a a a a a a a a the the a and 2 and the a a a the the the a the the the the the a a 2 and I a\n\n\n1\n\n\n\n\n\n\n\n\n\n\n\n\n\ns and /******/ and the and the and the and the and the the the the the the the the a and /******/ and the the the the the a and /******/ and /******/ 1 and the, a a a a and the a.\n\n\n /******/ and /******/ and /******/ and /******/ and /******/ and /******/ and the a a.jpg the the the a the the and a and /******/ and the a a the the the a a the a a a a a to the a a a a. /******/ a"
      },
      "finish_reason":"length"
    }
  ],
  "usage":{
    "prompt_tokens":25,
    "total_tokens":281,
    "completion_tokens":256
  }
}