ludwig-ai / ludwig

Low-code framework for building custom LLMs, neural networks, and other AI models
http://ludwig.ai
Apache License 2.0
11.1k stars 1.19k forks source link

Implement batch size tuning for None type LLM trainer (used for batch inference) #3525

Open arnavgarg1 opened 1 year ago

arnavgarg1 commented 1 year ago

Currently, it just returns a minimum batch size of 1. We want to be able to increase batch size as much as possible before we OOM. It only requires forward passes.

alexsherstinsky commented 11 months ago

take

alexsherstinsky commented 11 months ago

self-assign

arnavgarg1 commented 11 months ago

Questions from @alexsherstinsky on Ludwig Slack: Here are my questions and assumptions:

  1. Is the idea to maximize the batch_size for the "forward" path when the model is being trained (i.e., as part of fine-tuning), or only in the predict() part (i.e., purely as part of inference — “zero grad” — in which case we are not using base_model, but setting config["adapter"]["pretrained_adapter_weights"] to the already fine-tuned model name)?
  2. If this is the former, meaning that we are dealing with the fine-tuning case, then does that mean the eval_batch_size should accept the values of auto and None in the FineTuneTrainerConfig? If it is the latter, then the same question would be for LLMTrainerConfig (since NoneTrainerConfig extends LLMTrainerConfig, and the actual functionality would go into NoneTrainer). Or a different configuration class? By experimenting with the ECD test, test_trainer.py::test_tune_batch_size_and_lr, I hypothesized that for the LLM case, it would be regarding LLMTrainerConfig or NoneTrainerConfig; however, initial experiments point to FineTuneTrainerConfig, because that is how the configuration is parsed (i.e., eval_batch_size is validated in FineTuneTrainerConfig).
  3. Any additional background would be tremendously appreciated.
arnavgarg1 commented 11 months ago

Is the idea to maximize the batch_size for the "forward" path when the model is being trained (i.e., as part of fine-tuning), or only in the predict() part (i.e., purely as part of inference — “zero grad” — in which case we are not using base_model, but setting config["adapter"]["pretrained_adapter_weights"] to the already fine-tuned model name)?

The NoneType LLM Trainer class is used for zero shot/few shot inference! So this would be maximizing the batch_size for the model.train() function call where the trainer: type is set to None, something like this:

trainer:
   type: none

In this case, we would either use the pretrained/fine-tuned base model, or the pretrained base model + adapter to do the forward passes and maximize batch size.

If this is the former, meaning that we are dealing with the fine-tuning case, then does that mean the eval_batch_size should accept the values of auto and None in the FineTuneTrainerConfig? If it is the latter, then the same question would be for LLMTrainerConfig (since NoneTrainerConfig extends LLMTrainerConfig, and the actual functionality would go into NoneTrainer). Or a different configuration class? By experimenting with the ECD test, test_trainer.py::test_tune_batch_size_and_lr, I hypothesized that for the LLM case, it would be regarding LLMTrainerConfig or NoneTrainerConfig; however, initial experiments point to FineTuneTrainerConfig, because that is how the configuration is parsed (i.e., eval_batch_size is validated in FineTuneTrainerConfig).

Since this is the latter case, we'd basically be able to set batch_size: auto as part of the trainer config:

trainer:
    type: none
    batch_size: auto

which would try batch size from the smallest possible batch size (1) all the way to epoch size and fit the largest possible batch size that will fit into memory without running into CPU/GPU OOMs (we're mostly interested in GPU OOMs here). The actual code implementation will go into the NoneTypeTrainer implementation here: https://github.com/ludwig-ai/ludwig/blob/master/ludwig/trainers/trainer_llm.py#L217. Since the model is already initialized in memory by the time this function is called in the trainer, it just requires doing forward passes till we OOM.

Any additional background would be tremendously appreciated. I think there are two things we want to do here:

  1. Call forward passes with the torch.no_grad() context manager so we don't have to worry about memory from gradients, which will effectively allow for larger batch sizes.
  2. We want to create synthetic data where number of generated max_new_tokens == max_sequence_length of the output feature or global_max_sequence_length (whichever is specified) or some fallback default (say 256), and then pass in increasing batch sizes from 1 to epoch length using batches with each sample containing the maximum possible tokens to generate. The reason is that we want to tune the batch size based on the worst-case scenario where for each sample we produce all max_new_tokens tokens. Since traditional attention computation sequence length has a quadratic relationship with memory required, we want to tune to get the largest batch size for forward passes that produce the worst case max_new_tokens. I know that these models are autoregressive so we're not guaranteed this worst case, but I do think it's actually possible to force this through the Generation Config parameters: https://github.com/ludwig-ai/ludwig/blob/master/ludwig/schema/llms/generation.py#L10 by setting min_new_tokens to one of max_sequence_length of the output feature, global_max_sequence_length, or some fallback value like 256 tokens.

Let me know if this helps! Happy to answer more questions

alexsherstinsky commented 11 months ago

The following is the analysis of the situation (conducted by @arnavgarg1 and @alexsherstinsky) with ideas for future work.

TL/DR

Pause the work on this feature for the time being; further product-level and design discussions are required to proceed.

Details

Analysis

The "inconvenience" of zero-shot/few-shot inference requiring model.train() to be executed right before model.predict() is the peculiarity that has to do with the generation of metadata, available only from executing the model.train() call.

Hence, we have to "pseudo-train" the model in order to do zero-shot/few-shot inference. To do that, we configure the trainer type to be "none", and then when LudwigModel.train() executes, LudwigModel._tune_batch_size() will be called. Executing LudwigModel._tune_batch_size() will update LudwigModel.config_obj.trainer.batch_size = tuned_batch_size in memory. Internally, it calls the trainer's tune_batch_size() method. For ECD models, it really only does anything if batch_size is set to "auto" or eval_batch_size is set to "auto" / None; otherwise, it gets skipped, since these batch sizes were explicitly set by the user. For the NoneTrainer, we currently just return 1 (https://github.com/ludwig-ai/ludwig/blob/master/ludwig/trainers/trainer_llm.py#L217). The "pseudo-train" is really just calling model.evaluate() underneath the surface for each of the three data sets, which computes a set of "metrics"; essentially, one needs to make this call to model.train() -- even though no training is happening -- so as to be able to then perform model.predict() calls using the pre-trained LLMs (otherwise, one gets the error from Ludwig that the model "has not been trained").

However, in the current implementation, this optimal batch_size will not be available to model.predict() (because model.predict() accepts batch_size as an argument with 128 as the default). This fact implies that the batch_size optimization as it stands today, without some architectural changes, is designed solely for speeding up this (eventually obsolete) model.train() call. Moreover, when the trainer type is configured to be "none" and inference is performed on the freshly-loaded fine-tuned LLM, the model.train() call executes extremely fast (after Ludwig loads the models from storage). Here is an example configuration from a Google Colab notebook to illustrate the experiments that were conducted:

{
    "model_type": "llm",
    "base_model": "alexsherstinsky/Mistral-7B-v0.1-sharded",
    "input_features": [
        {
            "name": "dialogue",
            "type": "text",
            "preprocessing": {"max_sequence_length": 1024},
        }
    ],
    "output_features": [
        {
            "name": "summary",
            "type": "text",
            "preprocessing": {"max_sequence_length": 384},
        }
    ],
    "prompt": {
        "template": "Summarize this dialogue:\n### Dialogue: {dialogue}\n### Synopsis:"
    },
    "generation": {"temperature": 0.1, "max_new_tokens": 512},
    "adapter": {
        "type": "lora",
        "pretrained_adapter_weights": "alexsherstinsky/mistralai-7B-v01-based-finetuned-using-ludwig-with-samsum-A100-sharded-8bit-merged",
    },
    "quantization": {"bits": 8},
    "preprocessing": {"split": {"type": "fixed"}},
    "trainer": {"type": "none"},
}

samsum_test_dialogue: str = """
A: Hi Tom, are you busy tomorrow's afternoon?
B: I'm pretty sure I am. What's up?
A: Can you go with me to the animal shelter?.
B: What do you want to do?
A: I want to get a puppy for my son.
B: That will make him so happy.
A: Yeah, we've discussed it many times. I think he's ready now.
B: That's good. Raising a dog is a tough issue. Like having a baby ;-)
A: I'll get him one of those little dogs.
B: One that won't grow up too big;-)
A: And eat too much;-))
B: Do you know which one he would like?
A: Oh, yes, I took him there last Monday. He showed me one that he really liked.
B: I bet you had to drag him away.
A: He wanted to take it home right away ;-).
B: I wonder what he'll name it.
A: He said he'd name it after his dead hamster ⓠLemmy  - he's  a great Motorhead fan :-)))
"""

df_predict_control_example: pd.DataFrame = pd.DataFrame(
    data={
      "split": [0, 0, 0,],
      "dialogue": [samsum_test_dialogue, samsum_test_dialogue, samsum_test_dialogue,],
      "summary": ["inference", "inference", "inference",],
    }
)

model: LudwigModel = LudwigModel(config=trained_config, logging_level=logging.INFO)
results: TrainingResults = model.train(dataset=df_predict_control_example)
predictions: list[list[str]] = model.predict(dataset=df_predict_control_example.iloc[:1])

So, the fundamental question is: for which operation are we optimizing batch_size?

We do not yet have the definitive answer and would need to take a step back in order to better understand where we can see this batch_size optimization as a better fit: batch_prediction, or batch_evaluation, or otherwise.

Preliminary Considerations for Future Work

The eval_batch_size will get set in the model_config in the LudwigModel class and re-used at model.evaluate() time as well as during model.train() for the zero shot/few shot model. It is true, though, that we have hardcoded the batch_size for the predict method, most likely because it is deemed too costly to tune the batch_size to just do a model.predict() call. That being said, we could see making the case that if the model is of an LLM type, then we would override the default batch_size of 128 to be replaced by the tuned eval_batch_size prior to creating the predictor and calling predictor.batch_predict(). This can be implemented in a similar way to the mechanism in model.evaluate(), but only for LLM models (https://github.com/ludwig-ai/ludwig/blob/master/ludwig/api.py#L1076). That way, one pays the cost at "pseudo-training" time, with the benefit of then being able to use the best batch_size at inference time.

We can potentially implement this batch_size tuning logic for the FineTuneTrainer, whereby if batch_size is "auto" or eval_batch_size is "auto", then we override the implementation in the BaseTrainer class and do the batch_size optimization instead. This could be useful, because the batch_size for training and for evaluation will indeed be different, since the training forward passes are much more computationally intensive.