huggingface / autotrain-advanced

🤗 AutoTrain Advanced
https://huggingface.co/autotrain
Apache License 2.0
4.01k stars 492 forks source link

Missing config.json file after training using AutoTrain #299

Closed KabaTubare closed 10 months ago

KabaTubare commented 1 year ago

Context:

Environment: Google Colab (Pro Version using a V100) for training. Tool: Utilizing Hugging Face AutoTrain for fine-tuning a language model. Sequence of Events:

Initial Training: Successfully trained a model using AutoTrain. Process seemingly completed without errors, resulting in several output files. Missing config.json: Despite successful training, noticed that the config.json file was not generated. Without config.json, the trained model cannot be loaded for inference or further training.

Manual Configuration: Created a config.json manually based on a the ‘base model’ used for fine-tuning (NousResearch/Llama-2-7b-chat-hf) plus additional training and adapter parameters derived from the fine-tuned model’s files AutoTrain uploads to the HF repository. Uploaded this config.json to the Hugging Face repository where the model resides. Upload to Repository: Uploaded all relevant files, including pytorch_model.bin, adapter_config.json, adapter_model.bin, and others, to a Hugging Face repository named Kabatubare/meta_douglas_2. Model Loading Error: Attempted to load the model and encountered the following error: vbnetCopy code

OSError: Kabatubare/meta_douglas_2 does not appear to have a file named pytorch_model.bin, tf_model.h5, model.ckpt, or flax_model.msgpack. File Size Anomaly: Noticed that the size of the uploaded pytorch_model.bin is only 888 Bytes, which is far smaller than what is typical for such files.

Repository File Structure: adapter_config.json adapter_model.bin added_tokens.json config.json (manually added) pytorch_model.bin (888 Bytes, suspected to be incorrect or incomplete) Tokenizer files (tokenizer.json, tokenizer.model, etc.) Training parameters (training_args.bin, training_params.json)

Specific Questions for the Hugging Face /GitHub Community:

Configuration File: Why is a config.json not generated by AutoTrain by default? Is there a specific setting or flag that needs to be enabled to output this file? File Size Issue: What could cause pytorch_model.bin to be so small (888 Bytes)? Could this be a symptom of an incomplete or failed save operation? Manual Configuration: Are there standard procedures or checks to verify that a manually created config.json is accurate? Are there tools to validate the config.json against the actual PyTorch model file? Error Resolution: How to resolve the OSError encountered while loading the model? Are there specific requirements for the directory structure when loading models from a Hugging Face repository? Model Integrity: Given the missing config.json and the small size of pytorch_model.bin, are there steps to verify the integrity of the trained model?

abhishekkrthakur commented 1 year ago

peft are adapter only models. they dont have config.json. you need to merge it to base model after training manually or use --merge-adapter argument while training.

abhishekkrthakur commented 1 year ago

you can also use this space to merge adapters: https://huggingface.co/spaces/autotrain-projects/llm-merge-adapter

ruvilonix commented 1 year ago

Is there a way to use the --merge-adapter argument when using the Docker space?

ebayes commented 11 months ago

Hi! I've also been having a similar issue inferencing a finetuned model. I used the no code version of AutoTrain to finetune Mistral 7B (specifically this sharded version: https://huggingface.co/alexsherstinsky/Mistral-7B-v0.1-sharded) using a custom dataset in the correct format and I've hosted it (here: https://huggingface.co/seyabde/mistral_7b_yo_instruct). I've used Abhishek's space to merge adapters (https://huggingface.co/spaces/autotrain-projects/llm-merge-adapter) but it's not performing as expected. Any ideas why? Perhaps it could be because my finetuned model is saved in a checkpoint folder (here: https://huggingface.co/seyabde/mistral_7b_yo_instruct/tree/main/checkpoint-32028). Does this mean I should run the adapter on the contents of the checkpoint folder instead? I spent quite a bit on the training run so trying to avoid retraining! Any help much appreciated. Thanks!

abhishekkrthakur commented 11 months ago

@ebayes what doesnt perform as expected? is there an error? can you post screenshot of trained model files?

ebayes commented 11 months ago

I've launched it on Inference Endpoint and there it doesn't return an error but some prompts don't return any outputs i.e. [{'generated_text': ''}]. Because my AutoTrain run was saved in a checkpoint folder I'm not sure if when I merged the adapters, it it inferencing the base model instead of my finetuned version.

Screenshot of trained model files:

  1. Main folder (after merging the adapters)

    Screenshot 2023-12-18 at 9 04 02 PM
  2. checkpoint-32028 folder:

    Screenshot 2023-12-18 at 9 04 18 PM
ebayes commented 11 months ago

Another reason it might not return any output is because it's finetuned on Yoruba instruction-following demonstrations (not English) so if Mistral's training dataset doesn't contain much Yoruba, then it might just produce gobbledygook. If this is the case, I am going to finetune again first extending pre-training with a large unstructured Yoruba corpus then finetuning this model. If I were to do this, would I select SFT / Generic Trainer? What format should the data be in? As the docs just provide guidance for instruction following models i.e. https://huggingface.co/docs/autotrain/llm_finetuning

vishalmysore commented 11 months ago

please let me know once you have the solution, since i am in process of finetuning Mistral 7B and want to avoid finetuning again!

xihajun commented 10 months ago

you can also use this space to merge adapters: huggingface.co/spaces/autotrain-projects/llm-merge-adapter

We can also use this space locally with docker (testing it)

docker run --gpus all -it -p 7860:7860 --platform=linux/amd64 \
    registry.hf.space/autotrain-projects-llm-merge-adapter:latest python app.py

But it looks like we don't have .bin files generated by hf space

xihajun commented 10 months ago

Anyone has idea why the merged file's accuracy is so different from the one without merging?

Here is the script I used for merging manually

from transformers import AutoModelForCausalLM, AutoTokenizer, AutoConfig
import torch
from peft import PeftModel
adapter_path="xxx"
base_model_path="mistralai/Mistral-7B-v0.1"
target_model_path="model_output/"
config = AutoConfig.from_pretrained(base_model_path)

model = AutoModelForCausalLM.from_pretrained(
    base_model_path,
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    trust_remote_code=True,
)

model = PeftModel.from_pretrained(model, adapter_path)

tokenizer = AutoTokenizer.from_pretrained(
    base_model_path,
    trust_remote_code=True,
)
model = model.merge_and_unload()

logger.info("Saving target model...")
model.save_pretrained(target_model_path)
tokenizer.save_pretrained(target_model_path)
config.save_pretrained(target_model_path)
abhishekkrthakur commented 10 months ago

@xihajun how different is it? could you please create a new issue with more details on how you merge and how you calculate accuracy for merged and unmerged models and we will investigate?

KabaTubare commented 10 months ago

Thank you sonnuxj for getting back. At this point I've moved past this as there are easier ways, especially now, to get this work done.

Kind regards,

Troy Woodson

On Thu 28. Dec 2023 at 6.21 PM, abhishek thakur @.***> wrote:

@xihajun https://github.com/xihajun how different is it? could you please create a new issue with more details on how you merge and how you calculate accuracy for merged and unmerged models and we will investigate?

— Reply to this email directly, view it on GitHub https://github.com/huggingface/autotrain-advanced/issues/299#issuecomment-1871316638, or unsubscribe https://github.com/notifications/unsubscribe-auth/A7AVNVSZD6M3BE26IHREWJTYLWMALAVCNFSM6AAAAAA6A3FABSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNZRGMYTMNRTHA . You are receiving this because you authored the thread.Message ID: @.***>

xihajun commented 10 months ago

sure, I will do that, thanks @abhishekkrthakur

github-actions[bot] commented 10 months ago

This issue is stale because it has been open for 15 days with no activity.

github-actions[bot] commented 10 months ago

This issue was closed because it has been inactive for 2 days since being marked as stale.

SrushtiAckno commented 8 months ago

peft are adapter only models. they dont have config.json. you need to merge it to base model after training manually or use --merge-adapter argument while training.

how do we specify the --merge-adapter parameter in the colab notebook, like what should be the argument passed?

ahmed8047762 commented 6 months ago

@abhishekkrthakur I finetuned HuggingFaceH4/zephyr-7b-beta model on custom dataset using autotrain and tried to merge it with base model using duplicated space of your space but after 8/8 model's shreds were loaded, following error occured:

_TypeError: LoraConfig.init() got an unexpected keyword argument 'layerreplication'

any idea how to resolve this issue?