h2oai / h2o-llmstudio

H2O LLM Studio - a framework and no-code GUI for fine-tuning LLMs. Documentation: https://docs.h2o.ai/h2o-llmstudio/
https://h2o.ai
Apache License 2.0
3.92k stars 404 forks source link

Inference model on distributed GPU after model training via CLI #620

Closed yum-yeom closed 4 months ago

yum-yeom commented 6 months ago

Issue

I am trying to inference a model trained with the CLI in LLM Studio on a distributed GPU environment. I am trying to distribute the CLI trained model for inference, but I am facing some issues.

In order to infer the trained model on a valid set, I added epoch=0 and evaluate_before_training=True options to the distributed_train.sh script and tried to run it. However, I received an error message that the config.json file does not exist.

Queries

  1. If I need to manually create a config.json file to work around this situation?
  2. Or is there a separate script like distributed_inference.sh for distributed inference?

To Reproduce

  1. Path to the backbone model in cfg.py with the path to the model trained with the CLI. image
  2. Attempting to run distributed_inference.sh with epoch=0 & evaluate_before_training=True setting.
  3. Seeing an error message like the following: OSError: {{finetuned_model_path}} does not appear to have a file named config.json. Checkout 'https://huggingface.co//{{finetuned_model_path}}//main' for available files.

Let me know if there is anything else I should share. If you can help, it would be greatly appreciated.

LLM Studio version

v1.3.1

psinger commented 6 months ago

Hi @yum-yeom -

so there are two ways of doing it:

  1. You first push the final model to HF - only then will it be automatically transformed to HF format, that you can specify as new backbone.
  2. You need to keep everything as is, change epochs=0 like you did, and specify cfg.architecture.pretrained_weights="path_to_checkpoint.pth"

I believe you want rather option 2) - could you try it that way please and report back if it works?

yum-yeom commented 6 months ago

Hi! First of all, thank you for your answer.

I'm currently unable to upload and use the model in HF, so I tried method 2 and still get the error. Probably, it doesn't receive the cfg values I set.

So, I customized the part of importing Model and Tokenizer in LLM Studio source, saved the model as .bin in a separate path, imported it, and using it with train.py & epoch=0.

Using the saved model as above, the imported model was no problem and the performance seems to be reproducible.

psinger commented 4 months ago

please re-open in case there are still open issues