huggingface / optimum-neuron

Easy, fast and very cheap training and inference on AWS Trainium and Inferentia chips.
Apache License 2.0
199 stars 60 forks source link

Sampled/inconsistent output despite do_sample set to False #448

Closed tahirazim closed 4 months ago

tahirazim commented 8 months ago

Despite do_sample being set to False, we are occasionally (1-2% of the time) seeing TGI running Llama-7B models on INF2-Sagemaker return different outputs, despite being passed identical inputs. It seems there is sampling happening, even when TGI is being asked not to.

The following simple piece of code should reproduce the problem within 300-400 iterations:

import requests

request_parameters = {
    "best_of": None,
    "max_new_tokens": 64,
    "return_full_text": False,
    "do_sample": False,
}

prompt = "Write code to implement the merge sort algorithm."
response = requests.post(TGI_LLAMA_7B_URL, json = {'inputs': prompt, 'parameters': request_parameters})
response_text = response.json()[0]["generated_text"]

i=0
while True:
    response = requests.post(TGI_LLAMA_7B_URL, json = {'inputs': prompt, 'parameters': request_parameters})
    new_response_text = response.json()[0]["generated_text"]

    if response_text != new_response_text:
        print("Problem", i)
        print(response_text)
        print(new_response_text)
        break
    else:
        i = i + 1
        if i % 10 == 0:
            print("Good", i)

I'm running a TGI Docker image built from source from this repository at the following commit: https://github.com/huggingface/optimum-neuron/commit/3b3afa4dad

dacorvo commented 8 months ago

Can you try to reproduce this on a model without using TGI ? Just repeatedly call generate like you are doing here. This will make it easier to sort things out.

jimburtoft commented 8 months ago

Using this code to replicate without TGI and using CodeLlama. Still running.

#num_cores should be changed based on the instance.  inf2.24xlarge has 6 neuron processors (they have two cores each) so 12 total
#larger models will need more cores.  You can make your model smaller by changing fp16 to f8.  Some models may requre num_cores to be a power of 2 
compiler_args = {"num_cores": 2, "auto_cast_type": 'fp16'}
input_shapes = {"batch_size": 1, "sequence_length": 2048}

#Put in the model name from Hugging Face.  The example model comes from https://huggingface.co/codellama/CodeLlama-7b-hf
model_to_test = "codellama/CodeLlama-7b-hf"

model = NeuronModelForCausalLM.from_pretrained(model_to_test, export=True, **compiler_args, **input_shapes) 

from optimum.neuron import pipeline
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_to_test)

p = pipeline('text-generation', model, tokenizer)
p("My favorite place on earth is", max_new_tokens=64, do_sample=True, top_k=50)

prompt = "Write code to implement the merge sort algorithm."
gold_response = p(prompt,max_new_tokens=64, do_sample=False, best_of=None)

i=0
while True:
    response = p(prompt,max_new_tokens=64, do_sample=False, best_of=None)

    if gold_response != response:
        print("Problem", i)
        print(gold_response)
        print(response)
        break
    else:
        i = i + 1
        if i % 10 == 0:
            print("Good", i)
gante commented 8 months ago

Hi there πŸ‘‹

Batching, which TGI does under the hood, may change the output of the models, regardless of do_sample=False (which runs a deterministic algorithm). This property is also present in transformers and in APIs like OpenAI. I've written the technical details behind it in this comment πŸ€—

It explains why @tahirazim's script fails (dynamic batching with TGI) and @jimburtoft's doesn't (static batch size with transformers)

dacorvo commented 8 months ago

I agree dynamic batching will introduce padding, which in turns might lead to subtle differences. However, @tahirazim script seems synchronous, which means the TGI server would process one request at a time: hence, no batching and no padding. @tahirazim can you confirm this (and also that you're the only one accessing the TGI server during your test). @jimburtoft did you reproduce the issue with static batching ?

tahirazim commented 8 months ago

I tested calling a TGI-hosted Llama-7B model on Inf2 with batch_size=1 and MAX_CONCURRENT_REQUESTS=1, and it's always returning identical outputs when given identical inputs and do_sample set to False.

I've also deployed the model into Sagemaker, where multiple clients are invoking the model concurrently, with random exponential backoff in case a 429 is returned (MAX_CONCURRENT_REQUESTS is still set to 1). TGI is now behaving exactly as expected.

So it does seem like the problem is with TGI's continuous/dynamic batching.

gante commented 8 months ago

So it does seem like the problem is with TGI's continuous/dynamic batching.

I'd rephrase "TGI's continuous/dynamic batching" to "continuous/dynamic batching" πŸ˜„

jimburtoft commented 8 months ago

@dacorvo I tried the code I sent and a few minor variations. I let it run 600-2000 times over multiple runs and never saw a difference. Is there an easy way to test dynamic batching with the pipeline alone to confirm, or do we need multiple simultaneous requests? I think @tahirazim confirmed it with his test, but I'm happy to run anything that might be helpful.

dacorvo commented 8 months ago

You can easily test the effect of dynamic batching by encoding in the same batch the prompt and the prompt plus a truncated gold_response to simulate a generation in progress (I wonder at what level of generated inputs we start seeing differences).

jimburtoft commented 8 months ago

@dacorvo Wonder no longer. 2 words in, but it comes and goes.

I'm having some problems comparing the results automatically, but thankfully I noticed a difference manually, and it is consistently at the same place. As the prompt gets longer, it eventually goes away.

from optimum.neuron import NeuronModelForCausalLM

#num_cores should be changed based on the instance.  inf2.24xlarge has 6 neuron processors (they have two cores each) so 12 total
#larger models will need more cores.  You can make your model smaller by changing fp16 to f8.  Some models may requre num_cores to be a power of 2 
compiler_args = {"num_cores": 12, "auto_cast_type": 'fp16'}
input_shapes = {"batch_size": 1, "sequence_length": 2048}

#Put in the model name from Hugging Face.  The example model comes from https://huggingface.co/codellama/CodeLlama-7b-hf
model_to_test = "codellama/CodeLlama-7b-hf"

model = NeuronModelForCausalLM.from_pretrained(model_to_test, export=True, **compiler_args, **input_shapes) 

from optimum.neuron import pipeline
from transformers import AutoTokenizer
import json
tokenizer = AutoTokenizer.from_pretrained(model_to_test)

p = pipeline('text-generation', model, tokenizer)
p("My favorite place on earth is", max_new_tokens=10, do_sample=True, top_k=50)

prompt = "Write code to implement the merge sort algorithm."
gold_response = p(prompt,max_new_tokens=65, do_sample=False, best_of=None, return_full_text=True)
print(gold_response)
gold_response = gold_response[0]["generated_text"]

to_concatinate = p(prompt,max_new_tokens=65, do_sample=False, best_of=None, return_full_text=False)
#print(gold_response)
to_concatinate = to_concatinate[0]["generated_text"]

print (gold_response)

#gold_response = {gold_response['generated_text']}
#print("Initial Gold response: ", gold_response)

i=0
problem=0
while problem==0:
    concatenated_prompt=prompt
    for word in to_concatinate.split(" "):
        concatenated_prompt = concatenated_prompt + " " + word
        response = p(concatenated_prompt,max_new_tokens=75, do_sample=False, best_of=None, return_full_text=True)
        print("Word:", word)
        new_response_text = response[0]["generated_text"]
        if gold_response != new_response_text[:len(gold_response)+1]:
            print("Problem", i)
            print("Original response:")
            print(gold_response)
            print("New response:")
            print(new_response_text[:len(gold_response)+1])
            problem=1
        i = i + 1

Results from an inf2.24xlarge:

Word: Approach
Problem 1
Original response:
Write code to implement the merge sort algorithm.

## Approach & Efficiency

### Big O

- Time: O(n log n)
- Space: O(n)

## Solution

![merge sort](../../assets/merge-sort.jpg)

### Code

```javascript
function merge
New response:
Write code to implement the merge sort algorithm.

## Approach & Efficiency

### Big O

- Time: O(n log n)
- Space: O(n)

## Solution

![Whiteboard](./assets/merge-sort.jpg)

### Code

```javascript
const mergeSort =
dacorvo commented 8 months ago

@jimburtoft thank you for checking that out: this confirms that even with a batch size of two and padding of one the problem arises. The only solution if you want something deterministic for the whole generation would be to forget about continuous batching, use multiple TGI servers each accepting only one request, and do the load-balancing at the SageMaker level (see https://aws.amazon.com/de/blogs/aws/amazon-sagemaker-adds-new-inference-capabilities-to-help-reduce-foundation-model-deployment-costs-and-latency/). cc @philschmid

dacorvo commented 8 months ago

Another option would be to take a completely different approach and cache the results in a front-end, but that is not really a TGI issue then.

HuggingFaceDocBuilderDev commented 5 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Thank you!

HuggingFaceDocBuilderDev commented 4 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Thank you!