Inference model on distributed GPU after model training via CLI

yum-yeom commented 6 months ago

Issue

I am trying to inference a model trained with the CLI in LLM Studio on a distributed GPU environment. I am trying to distribute the CLI trained model for inference, but I am facing some issues.

In order to infer the trained model on a valid set, I added epoch=0 and evaluate_before_training=True options to the distributed_train.sh script and tried to run it. However, I received an error message that the config.json file does not exist.

Queries

If I need to manually create a config.json file to work around this situation?
Or is there a separate script like distributed_inference.sh for distributed inference?

To Reproduce

Path to the backbone model in cfg.py with the path to the model trained with the CLI.
Attempting to run distributed_inference.sh with epoch=0 & evaluate_before_training=True setting.
Seeing an error message like the following: OSError: {{finetuned_model_path}} does not appear to have a file named config.json. Checkout 'https://huggingface.co//{{finetuned_model_path}}//main' for available files.

Let me know if there is anything else I should share. If you can help, it would be greatly appreciated.

LLM Studio version

v1.3.1

psinger commented 6 months ago

Hi @yum-yeom -

so there are two ways of doing it:

You first push the final model to HF - only then will it be automatically transformed to HF format, that you can specify as new backbone.
You need to keep everything as is, change epochs=0 like you did, and specify cfg.architecture.pretrained_weights="path_to_checkpoint.pth"

I believe you want rather option 2) - could you try it that way please and report back if it works?

yum-yeom commented 6 months ago

Hi! First of all, thank you for your answer.

I'm currently unable to upload and use the model in HF, so I tried method 2 and still get the error. Probably, it doesn't receive the cfg values I set.

So, I customized the part of importing Model and Tokenizer in LLM Studio source, saved the model as .bin in a separate path, imported it, and using it with train.py & epoch=0.

Using the saved model as above, the imported model was no problem and the performance seems to be reproducible.

psinger commented 4 months ago

please re-open in case there are still open issues

h2oai / h2o-llmstudio