[Question]: How to use the finetuned embedding model?

Aekansh-Ak commented 10 months ago

Question Validation

[X] I have searched both the documentation and discord for an answer.

Question

I was able to finetune the model using this documentation:

https://gpt-index.readthedocs.io/en/stable/examples/finetuning/embeddings/finetune_embedding.html

It created a folder "model_output_test" with files-: config.json pytorch_model.bin

The original embedding model downloaded from hugging face has these files-: config_sentence_transformers.json pytorch_model.bin sentence_bert_config.json tokenizer_config.json vocab.txt config.json modules.json README.md special_tokens_map.json tokenizer.json

I am currently using the embedding model like this-: embed_model = HuggingFaceInstructEmbeddings(model_name="BAAI/bge-large-en-v1.5", model_kwargs={"device": "cpu"})

How do I use the one which is finetuned on my data since it is in a folder, plus is it okay for it to just have those two files?

dosubot[bot] commented 10 months ago

🤖

Hello @Aekansh-Ak,

I'm a bot here to assist you while waiting for a human maintainer. I can help solve bugs, answer questions, and even provide guidance on how to become a contributor. Rest assured, I'm already working on your issue regarding the finetuned embedding model. I'll provide a full answer shortly. Thanks for reaching out!

dosubot[bot] commented 10 months ago

🤖

To use the finetuned embedding model from the "model_output_test" folder with the HuggingFaceInstructEmbeddings function in LlamaIndex, you need to provide the path to the "model_output_test" folder as the model_name parameter when initializing the HuggingFaceEmbedding class.

The HuggingFaceEmbedding class uses the AutoModel.from_pretrained and AutoTokenizer.from_pretrained methods from the HuggingFace transformers library to load the model and tokenizer. These methods can accept a local file path to a directory containing the model files (i.e., "config.json" and "pytorch_model.bin") as the pretrained_model_name_or_path parameter.

Here is an example of how you can do this:

from llama_index.embeddings.huggingface import HuggingFaceEmbedding

# Path to the directory containing the model files
model_path = "model_output_test"

# Initialize the HuggingFaceEmbedding class with the model path
embedding = HuggingFaceEmbedding(model_name=model_path)

In this example, model_path is the path to the "model_output_test" folder. The HuggingFaceEmbedding class will automatically load the model and tokenizer from this directory.

Regarding your second question, it is okay for the finetuned model to just have the "config.json" and "pytorch_model.bin" files. The "config.json" file contains the configuration of the model, and the "pytorch_model.bin" file contains the weights of the model. These are the essential files needed to load the model. The other files in the original model directory are not strictly necessary for loading the model, but they may provide additional functionality such as tokenization.

Please note that the HuggingFaceEmbedding class will attempt to load a tokenizer using the AutoTokenizer.from_pretrained method. If the tokenizer files are not present in the "model_output_test" directory, it will fall back to the default tokenizer for the model architecture specified in the "config.json" file. If you have a specific tokenizer that you want to use with the model, you should include the tokenizer files in the "model_output_test" directory.

I hope this helps! Let me know if you have any other questions.

Sources

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

logan-markewich commented 10 months ago

If the model is in a folder, you should be able to just set the model_name as the path to the folder

Aekansh-Ak commented 10 months ago

Error: ValueError: Unrecognized model in /home/seceon/dalm/model_output_test. Should have a model_type key in its config.json, or contain one of the following strings in its name: albert, align, altclip, audio-spectrogram-transformer, autoformer, bark, bart, beit, bert, bert-generation, big_bird, bigbird_pegasus, biogpt, bit, blenderbot, blenderbot-small, blip, blip-2, bloom, bridgetower, camembert, canine, chinese_clip, clap, clip, clipseg, codegen, conditional_detr, convbert, convnext, convnextv2, cpmant, ctrl, cvt, data2vec-audio......

This is all that is in model_ouput_test.

{"in_features": 1024, "out_features": 1024, "bias": false}

PhiBrandon commented 10 months ago

Error: ValueError: Unrecognized model in /home/seceon/dalm/model_output_test. Should have a model_type key in its config.json, or contain one of the following strings in its name: albert, align, altclip, audio-spectrogram-transformer, autoformer, bark, bart, beit, bert, bert-generation, big_bird, bigbird_pegasus, biogpt, bit, blenderbot, blenderbot-small, blip, blip-2, bloom, bridgetower, camembert, canine, chinese_clip, clap, clip, clipseg, codegen, conditional_detr, convbert, convnext, convnextv2, cpmant, ctrl, cvt, data2vec-audio......

This is all that is in model_ouput_test.

{"in_features": 1024, "out_features": 1024, "bias": false}

Were you able to resolve this? Navigating to the BAAI/bge-large-en-v1.5 model card, then looking at the config.json file, line 22 declares "model_type": "bert". So, adding that to your config file may resolve this?

Also, according to the notebook you linked, it seems like it references the newly finetunned model as local:test_model. Is that true in your case as well?

This comment just serves as potential starting points, if neither of the above have been attempted.

dosubot[bot] commented 7 months ago

Hi, @Aekansh-Ak,

I'm helping the LlamaIndex team manage their backlog and am marking this issue as stale. From what I understand, you were seeking guidance on using a fine-tuned embedding model in your code. Dosubot provided detailed guidance on using the model with the HuggingFaceEmbedding class, and other users offered assistance to resolve an error related to an unrecognized model type in the config.json file. The issue has been resolved with the provided guidance and assistance.

Could you please confirm if this issue is still relevant to the latest version of the LlamaIndex repository? If it is, please let the LlamaIndex team know by commenting on the issue. Otherwise, feel free to close the issue yourself, or it will be automatically closed in 7 days.

Thank you for your understanding and contributions to the LlamaIndex project. If you have any further questions or issues, feel free to reach out.

GEN200J commented 6 months ago

Error: ValueError: Unrecognized model in /home/seceon/dalm/model_output_test. Should have a model_type key in its config.json, or contain one of the following strings in its name: albert, align, altclip, audio-spectrogram-transformer, autoformer, bark, bart, beit, bert, bert-generation, big_bird, bigbird_pegasus, biogpt, bit, blenderbot, blenderbot-small, blip, blip-2, bloom, bridgetower, camembert, canine, chinese_clip, clap, clip, clipseg, codegen, conditional_detr, convbert, convnext, convnextv2, cpmant, ctrl, cvt, data2vec-audio......

This is all that is in model_ouput_test.

{"in_features": 1024, "out_features": 1024, "bias": false}

I got the same issue with a fine tuned bge m3 model

run-llama / llama_index