h2oai / h2o-llmstudio

H2O LLM Studio - a framework and no-code GUI for fine-tuning LLMs. Documentation: https://docs.h2o.ai/h2o-llmstudio/
https://h2o.ai
Apache License 2.0
3.92k stars 405 forks source link

[BUG] Cannot Reproduce H2O Prediction Output #450

Closed diogobragaswogo closed 8 months ago

diogobragaswogo commented 11 months ago

🐛 Bug

I trained a model based on circulus/Llama-2-7b-orca-v1 and exported:

However, I'm currently having troubles reproducing the exact output obtained in the validation predictions CSV. Specifically, the model is using greedy search and I've tried loading the model in the following ways:

  1. Using vLLM
  2. Using HF TGI
  3. Using the sample code of the model card.

vLLM and HF TGI give the same output when running with greedy search, but it differs from the one in the prediction file. The sample code of the model gives a different output to the ones of vLLM, HF TGI and the one in the file, when using greedy search (although it comes closer than vLLM and HF TGI).

Considering the above, I'm not sure if I'm loading the model correctly (and passing the intended generation params), or if there is an issue. The prediction params are as follows:

prediction:
    batch_size_inference: 0
    do_sample: false
    max_length_inference: 1024
    metric: BLEU
    metric_gpt_model: gpt-3.5-turbo-0301
    min_length_inference: 2
    num_beams: 1
    num_history: 4
    repetition_penalty: 1.2
    stop_tokens: ''
    temperature: 0.3
    top_k: 0
    top_p: 1.0

To Reproduce

Just train the circulus/Llama-2-7b-orca-v1 (or another model) on custom data and check whether the validation prediction data can be reproduced (i.e. the outputs) using greedy search.

LLM Studio version

maxjeblick commented 11 months ago

Thank you for the detailed description! Would it be possible for you to share the training configuration YAML file?

Regarding the discrepancies between validation predictions and hosted inference using vLLM and HF TGI, one or more of the following could be a potential explanation:

Additionally, we've observed discrepancies in validation and chat outputs attributed to mismatches in tokenizer configurations, as detailed here.

diogobragaswogo commented 11 months ago

Hey @maxjeblick

Thank you for your swift response. From my investigation, it may be due to LORA and Quantization (I'm still in the process of assessing whether tokenizer configurations would come into play). However, just to be sure, here's the training configuration YAML used:

training:
    batch_size: 1
    differential_learning_rate: 1.0e-05
    differential_learning_rate_layers: []
    drop_last_batch: true
    epochs: 1
    evaluate_before_training: false
    evaluation_epochs: 1.0
    grad_accumulation: 1
    gradient_clip: 0.0
    learning_rate: 0.0001
    lora: true
    lora_alpha: 16
    lora_dropout: 0.05
    lora_r: 4
    lora_target_modules: ''
    loss_function: TokenAveragedCrossEntropy
    optimizer: AdamW
    save_best_checkpoint: false
    schedule: Cosine
    train_validation_data: false
    warmup_epochs: 0.0
    weight_decay: 0.0

If it turns out to be due to LORA/Quantization, is there any recommended literature I could refer to (specific to H2O LLM Studio) for the best way to handle these, as to minimize differences between training and prediction? My goal here is to avoid falling into a "rabbit-hole" of constantly training the model to increase the BLEU score, but then have the corresponding BLEU in prediction randomly vary (no problem if it doesn't increase as much as in training, so as long as it still increases).

maxjeblick commented 11 months ago

Are your predictions differ completely, is one method constantly producing worse results, or do the predictions differ after some words?

as to minimize differences between training and prediction?

Switching to fp16 and inference batch size 1 (see discussion here) should reduce differences. Apart from that, slight differences in logit output between using the model for inference during the training process, using the (merged) model for inference and using a potentially different implementation (vLLM/TGI) are expected. If predictions are completely different or worse for one method, there probably a bug somewhere.

diogobragaswogo commented 11 months ago

They seem to only occur in a subset, although this subset represents around 20% of validation prompts. The methods outside of H2O tend to be producing worse results than the ones found in the validation file, but it's not always.

The prompt and output consist of JSON values (among other things), which tend to make the prediction vary past the initial 100 tokens or so.

Considering your previous explanation, I wouldn't say that they are completely different (or worse) to the point of a bug necessarily existing. However, I'll continue doing some testing with the new information you've given me and will come back with my findings! Thank you for the help so far.

diogobragaswogo commented 10 months ago

@maxjeblick Sorry for the long hiatus - I'm back with some results and they are... Confusing.

I generated the outputs for the validation set of my model against these three different implementations:

All of the above are using generation config parameters to make them as deterministic as possible (i.e. temp=0.1, do_sample=False, etc.), since in LLM Studio, the validation used deterministic generation. Here's what I found:

Text Generation Interface

Hugging Face Basic Transformers Pipeline

vLLM

Although looking at it from a manual qualitative approach, some of the results are not as bad, they are definitely substantially different from H2O's, this is considering that all of them are being deterministic.

Based on this, I'm not sure if there's a bug, but this definitely makes it harder to improve the model's quality in production. For reference, here's the source code used to get the BLEU scores:

import pandas as pd
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction

# Load the CSV file into a DataFrame
df = pd.read_csv("./vllm_infer_results.csv")

# Initialize the BLEU smoothing function
smooth = SmoothingFunction().method1

# Function to compute BLEU score
def compute_bleu(reference, candidate):
    reference_tokens = reference.split()
    candidate_tokens = candidate.split()
    return sentence_bleu([reference_tokens], candidate_tokens, smoothing_function=smooth)

# Compute BLEU scores for "truth" vs. "h2o_pred"
bleu_truth_vs_h2o = df.apply(lambda row: compute_bleu(row['truth'], row['h2o_pred']), axis=1)

# Compute BLEU scores for "truth" vs. "model_pred"
bleu_truth_vs_model = df.apply(lambda row: compute_bleu(row['truth'], row['model_pred']), axis=1)

# Compute BLEU scores for "h2o_pred" vs. "model_pred"
bleu_h2o_vs_model = df.apply(lambda row: compute_bleu(row['h2o_pred'], row['model_pred']), axis=1)

# BLEU scores are stored in the respective Series
# You can access the scores for individual rows as needed
print(f"BLEU score (truth vs. h2o_pred): {bleu_truth_vs_h2o.mean()}")
print(f"BLEU score (truth vs. model_pred): {bleu_truth_vs_model.mean()}")
print(f"BLEU score (h2o_pred vs. model_pred): {bleu_h2o_vs_model.mean()}")
maxjeblick commented 10 months ago

Thanks for the detailed description! I think this recent thread (and corresponding issue) should be relevant. We'll monitor the issue mentioned and implement/support potential fixes.

As your issue is hard to debug remotely, I suggest:

In general, I'd also suggest to use GPT metric in favor of BLEU for incremental model finetuning, as GPT metric is much more aligned with model quality. Otherwise, Perplexity should also be fine to use.

diogobragaswogo commented 10 months ago

That issue does sound like the scenario that is happening. I'll do a round of testing with those suggestions and will get back to here with the results! Thanks for your time and help so far @maxjeblick

diogobragaswogo commented 10 months ago

Again, sorry for the long time between replies, but I have some new findings.

This time, I only tested with vLLM and Hugging Face Basic Transformers Pipeline. What I tried:

All of the above still had Lora activated (need to find a way to re-train without lora).

The findings:

vLLM int8

vLLM float16

Hugging Face Basic Transformers Pipeline int8

Hugging Face Basic Transformers Pipeline float16

Based on the above, there's a substantial difference in prediction when one goes from int4 to int8/float16.

Subsequently, I haven't been able to export the float32 model, as it seems it does not fit using GPUs (I had to use deepspeed). Is there a workaround to being able to download the model weights, when it only fits using deepspeed?

I still need to validate with a float32 model and without LORA, but it does seem that the issue referenced in the thread you've proded @maxjeblick could be the root cause.

psinger commented 10 months ago

You can select cpu as device when exporting if it does not fit the GPU.