[BUG] Cannot Reproduce H2O Prediction Output

diogobragaswogo commented 11 months ago

🐛 Bug

I trained a model based on circulus/Llama-2-7b-orca-v1 and exported:

Model
Validation predictions
Logs

However, I'm currently having troubles reproducing the exact output obtained in the validation predictions CSV. Specifically, the model is using greedy search and I've tried loading the model in the following ways:

Using vLLM
Using HF TGI
Using the sample code of the model card.

vLLM and HF TGI give the same output when running with greedy search, but it differs from the one in the prediction file. The sample code of the model gives a different output to the ones of vLLM, HF TGI and the one in the file, when using greedy search (although it comes closer than vLLM and HF TGI).

Considering the above, I'm not sure if I'm loading the model correctly (and passing the intended generation params), or if there is an issue. The prediction params are as follows:

prediction:
    batch_size_inference: 0
    do_sample: false
    max_length_inference: 1024
    metric: BLEU
    metric_gpt_model: gpt-3.5-turbo-0301
    min_length_inference: 2
    num_beams: 1
    num_history: 4
    repetition_penalty: 1.2
    stop_tokens: ''
    temperature: 0.3
    top_k: 0
    top_p: 1.0

To Reproduce

Just train the circulus/Llama-2-7b-orca-v1 (or another model) on custom data and check whether the validation prediction data can be reproduced (i.e. the outputs) using greedy search.

LLM Studio version

Nightly version of LLM Studio (as obtained by running docker run ... gcr.io/vorvan/h2oai/h2o-llmstudio:nightly).

maxjeblick commented 11 months ago

Thank you for the detailed description! Would it be possible for you to share the training configuration YAML file?

Regarding the discrepancies between validation predictions and hosted inference using vLLM and HF TGI, one or more of the following could be a potential explanation:

LORA: If LORA is merged after training, the resulting model may produce slightly different logits.
Quantization/Model Data Type. This may be different during training and inference. Note that we also have additional logic to cast certain params to fp32 for model stability during training.
Implementation Differences between HF and vLLM/TGI: An issue related to this has been documented here.

Additionally, we've observed discrepancies in validation and chat outputs attributed to mismatches in tokenizer configurations, as detailed here.

diogobragaswogo commented 11 months ago

Hey @maxjeblick

Thank you for your swift response. From my investigation, it may be due to LORA and Quantization (I'm still in the process of assessing whether tokenizer configurations would come into play). However, just to be sure, here's the training configuration YAML used:

training:
    batch_size: 1
    differential_learning_rate: 1.0e-05
    differential_learning_rate_layers: []
    drop_last_batch: true
    epochs: 1
    evaluate_before_training: false
    evaluation_epochs: 1.0
    grad_accumulation: 1
    gradient_clip: 0.0
    learning_rate: 0.0001
    lora: true
    lora_alpha: 16
    lora_dropout: 0.05
    lora_r: 4
    lora_target_modules: ''
    loss_function: TokenAveragedCrossEntropy
    optimizer: AdamW
    save_best_checkpoint: false
    schedule: Cosine
    train_validation_data: false
    warmup_epochs: 0.0
    weight_decay: 0.0

If it turns out to be due to LORA/Quantization, is there any recommended literature I could refer to (specific to H2O LLM Studio) for the best way to handle these, as to minimize differences between training and prediction? My goal here is to avoid falling into a "rabbit-hole" of constantly training the model to increase the BLEU score, but then have the corresponding BLEU in prediction randomly vary (no problem if it doesn't increase as much as in training, so as long as it still increases).

maxjeblick commented 11 months ago

Are your predictions differ completely, is one method constantly producing worse results, or do the predictions differ after some words?

as to minimize differences between training and prediction?

Switching to fp16 and inference batch size 1 (see discussion here) should reduce differences. Apart from that, slight differences in logit output between using the model for inference during the training process, using the (merged) model for inference and using a potentially different implementation (vLLM/TGI) are expected. If predictions are completely different or worse for one method, there probably a bug somewhere.

diogobragaswogo commented 11 months ago

They seem to only occur in a subset, although this subset represents around 20% of validation prompts. The methods outside of H2O tend to be producing worse results than the ones found in the validation file, but it's not always.

The prompt and output consist of JSON values (among other things), which tend to make the prediction vary past the initial 100 tokens or so.

Considering your previous explanation, I wouldn't say that they are completely different (or worse) to the point of a bug necessarily existing. However, I'll continue doing some testing with the new information you've given me and will come back with my findings! Thank you for the help so far.

diogobragaswogo commented 10 months ago

@maxjeblick Sorry for the long hiatus - I'm back with some results and they are... Confusing.

I generated the outputs for the validation set of my model against these three different implementations:

Text Generation Interface.
Hugging Face Basic transformers pipeline.
vLLM

All of the above are using generation config parameters to make them as deterministic as possible (i.e. temp=0.1, do_sample=False, etc.), since in LLM Studio, the validation used deterministic generation. Here's what I found:

Text Generation Interface

BLEU score (truth vs. h2o_pred): 80.88295765859095
BLEU score (truth vs. model_pred): 28.665971262286216
BLEU score (h2o_pred vs. model_pred): 29.20357323737969

Hugging Face Basic Transformers Pipeline

BLEU score (truth vs. h2o_pred): 80.88295765859095
BLEU score (truth vs. model_pred): 26.181086090582695
BLEU score (h2o_pred vs. model_pred): 28.04616682539429

vLLM

BLEU score (truth vs. h2o_pred): 80.88295765859095
BLEU score (truth vs. model_pred): 21.74056099054717
BLEU score (h2o_pred vs. model_pred): 20.213948909024584

Although looking at it from a manual qualitative approach, some of the results are not as bad, they are definitely substantially different from H2O's, this is considering that all of them are being deterministic.

Based on this, I'm not sure if there's a bug, but this definitely makes it harder to improve the model's quality in production. For reference, here's the source code used to get the BLEU scores:

import pandas as pd
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction

# Load the CSV file into a DataFrame
df = pd.read_csv("./vllm_infer_results.csv")

# Initialize the BLEU smoothing function
smooth = SmoothingFunction().method1

# Function to compute BLEU score
def compute_bleu(reference, candidate):
    reference_tokens = reference.split()
    candidate_tokens = candidate.split()
    return sentence_bleu([reference_tokens], candidate_tokens, smoothing_function=smooth)

# Compute BLEU scores for "truth" vs. "h2o_pred"
bleu_truth_vs_h2o = df.apply(lambda row: compute_bleu(row['truth'], row['h2o_pred']), axis=1)

# Compute BLEU scores for "truth" vs. "model_pred"
bleu_truth_vs_model = df.apply(lambda row: compute_bleu(row['truth'], row['model_pred']), axis=1)

# Compute BLEU scores for "h2o_pred" vs. "model_pred"
bleu_h2o_vs_model = df.apply(lambda row: compute_bleu(row['h2o_pred'], row['model_pred']), axis=1)

# BLEU scores are stored in the respective Series
# You can access the scores for individual rows as needed
print(f"BLEU score (truth vs. h2o_pred): {bleu_truth_vs_h2o.mean()}")
print(f"BLEU score (truth vs. model_pred): {bleu_truth_vs_model.mean()}")
print(f"BLEU score (h2o_pred vs. model_pred): {bleu_h2o_vs_model.mean()}")

maxjeblick commented 10 months ago

Thanks for the detailed description! I think this recent thread (and corresponding issue) should be relevant. We'll monitor the issue mentioned and implement/support potential fixes.

As your issue is hard to debug remotely, I suggest:

Avoiding left-padding during inference (inference batch size=1) and disable kv cache (add use_cache=Falseargument here). We may add use_cache to the configuration (and UI hence) in the future.
To elobrate on the post: Train a smaller model in fp32 (without LORA if possible), and see if models are better aligned (on short texts, preferably). If so, the discrepancy is likely a result of the rounding errors.

In general, I'd also suggest to use GPT metric in favor of BLEU for incremental model finetuning, as GPT metric is much more aligned with model quality. Otherwise, Perplexity should also be fine to use.

diogobragaswogo commented 10 months ago

That issue does sound like the scenario that is happening. I'll do a round of testing with those suggestions and will get back to here with the results! Thanks for your time and help so far @maxjeblick

diogobragaswogo commented 10 months ago

Again, sorry for the long time between replies, but I have some new findings.

This time, I only tested with vLLM and Hugging Face Basic Transformers Pipeline. What I tried:

Fine-tuning with int8 weights
Fine-tuning with float16 weights
Fine-tuning with float32 weights (required deepspeed)

All of the above still had Lora activated (need to find a way to re-train without lora).

The findings:

vLLM int8

BLEU score (truth vs. h2o_pred): 81.745937661239095
BLEU score (truth vs. model_pred): 70.9285796075405
BLEU score (h2o_pred vs. model_pred): 63.29405626127132

vLLM float16

BLEU score (truth vs. h2o_pred): 82.1443564857712
BLEU score (truth vs. model_pred): 71.02544444704684
BLEU score (h2o_pred vs. model_pred): 63.98552556438229

Hugging Face Basic Transformers Pipeline int8

BLEU score (truth vs. h2o_pred): 86.38065571613316
BLEU score (truth vs. model_pred): 70.19465206463109
BLEU score (h2o_pred vs. model_pred): 72.98347839674122

Hugging Face Basic Transformers Pipeline float16

BLEU score (truth vs. h2o_pred): 84.68455141870595
BLEU score (truth vs. model_pred): 65.59330490822121
BLEU score (h2o_pred vs. model_pred): 70.46537890906082

Based on the above, there's a substantial difference in prediction when one goes from int4 to int8/float16.

Subsequently, I haven't been able to export the float32 model, as it seems it does not fit using GPUs (I had to use deepspeed). Is there a workaround to being able to download the model weights, when it only fits using deepspeed?

I still need to validate with a float32 model and without LORA, but it does seem that the issue referenced in the thread you've proded @maxjeblick could be the root cause.

psinger commented 10 months ago

You can select cpu as device when exporting if it does not fit the GPU.

h2oai / h2o-llmstudio