hao-ai-lab / LookaheadDecoding

Apache License 2.0
1.04k stars 63 forks source link

The inference results are inconsistent with Huggingface. #30

Open cyfwry opened 7 months ago

cyfwry commented 7 months ago

Hello! Thanks for this new parallel decoding algorithm. When I was using minimal.py to compare the performance of LookaheadDecoding and Huggingface, I found that the output of some test cases was not consistent with Huggingface. Here I share my test code and environment, which is modified from minimal.py.


GPU: A100-80G
cuda: 11.8
driver: 470.103.01
python3: 3.9.16
pytorch: 1.13.0
transformers: 4.34.0


from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
import time 
import os 
if int(os.environ.get("LOAD_LADE", 0)):
    import lade 
    lade.config_lade(LEVEL=7, WINDOW_SIZE=20, GUESS_SET_SIZE=20, DEBUG=1)

assert torch.cuda.is_available()

torch_device = "cuda"
model_name = "TinyLlama-1.1B-Chat-v0.3"

tokenizer = AutoTokenizer.from_pretrained(model_name)

model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16, device_map=torch_device)
model.tokenizer = tokenizer

data = "How to write a shell script to get a program to restart itself on crash"
model_inputs = tokenizer(data, return_tensors='pt').to(torch_device)
greedy_output = model.generate(**model_inputs, max_new_tokens=256, do_sample=False)
print(tokenizer.decode(greedy_output[0], skip_special_tokens=True))

You can use it like minimal.py. When I test, the output of Huggingface is:

How to write a shell script to get a program to restart itself on crash?
How do I write a shell script to restart a program when it crashes?
I have a program that I want to automatically restart when it crashes. I want it to be a simple script that just starts the program and then waits for it to finish and then starts the program again.
I've tried using kill -9 <pid> but it doesn't work. Any ideas?
Here's a simple script that should work:

# Start the program

# Wait for it to finish
while true; do
  # Check if the program is running
  pid=$(ps ax | grep "$PROGRAM_NAME" | awk '{print $2}')
  if [ -z "$pid" ]; then
    # Program is not running, start it

  # Check if the program has finished
  sleep 10

This script uses the ps ax command to list the processes running the program and the sleep 10 to wait for 10 seconds before starting it again.
I hope this helps! Let me know if you have any questions.

A: You can use the

And the output of LookaheadDecoding is:

How to write a shell script to get a program to restart itself on crash?
How do I write a shell script to restart a program when it crashes?
I have a program that I want to automatically restart when it crashes. I want it to be a simple script that just starts the program and then waits for it to finish and then starts the program again.
I've tried using kill -9 <pid> but it doesn't work. Any ideas?
Here's a simple script that should work:

# Start the program

# Wait for it to finish
while true; do
  # Check if the program is running
  pid=$(ps -p $USER -o pid= --no-headers | awk '{print $1}')
  if [ -z "$pid" ]; then
    echo "Program not running, waiting..."
    sleep 10
    # Start the program again

This script uses ps to check if the program is running and, if it is, it waits for it to finish. If it's not running, it starts the program again.
I hope this helps! Let me know if you have any questions.
LMX-xin commented 7 months ago

I also found that the inference results are inconsistent. The input is "who are you ", here is a space after "you" ,max_new_tokens=20.the values of N,W and G are 3, 2, 2 and the inference results of N, W and G values of 4, 4 ,4 are inconsistent.

Viol2000 commented 7 months ago

Yes, sometimes the results could be inconsistent. We owe it to floating point errors, and it is normal -- hf w/ flash-attn and hf w/o flash-attn can also have different outputs sometimes. If you use float32, lade's results should be exactly the same as hf's.

cyfwry commented 7 months ago

Yes, sometimes the results could be inconsistent. We owe it to floating point errors, and it is normal -- hf w/ flash-attn and hf w/o flash-attn can also have different outputs sometimes. If you use float32, lade's results should be exactly the same as hf's.

Thanks for your reply. I figured this out. In the case of parallel decoding, due to the loss of precision, there is indeed no guarantee that the result will be exactly the same as that of single-step decoding.

Viol2000 commented 7 months ago

This is the floating point error. Although the logical flows are the same, the computations that happen in GPU are different (i.e., lade computes several tokens per step while hf only computes one token per step). Different floating point tensor computations could have very similar outputs, but there will always be differences. Even if the difference is slight, it will accumulate and turn into inconsistent output someday. It can explain that the inconsistency happens when the output is relatively long.

Viol2000 commented 7 months ago

And I do not think hf fp16's output is the 'correct' one. hf fp32/lade fp32 outputs should be the 'correct' one. Sometimes, lade fp16's output can align with the fp32 output, while hf fp16's output can be inconsistent with its fp32 output.

cyfwry commented 7 months ago

And I do not think hf fp16's output is the 'correct' one. hf fp32/lade fp32 outputs should be the 'correct' one. Sometimes, lade fp16's output can align with the fp32 output, while hf fp16's output can be inconsistent with its fp32 output.

I believe that it is impossible for both Lade and HF to always maintain consistency between FP16 output and FP32 output. Initially, I thought that Lade and HF always maintain consistency in output under the same precision.