run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
35.49k stars 5.01k forks source link

[Bug]: sentencetransformersfinetuneengine not working with nomic embeddings #13289

Open Miheer99 opened 4 months ago

Miheer99 commented 4 months ago

Bug Description

Hi

Im trying to finetune Nomic-ai-embedding using SentenceTransformersFinetuneEngine and am running into an issue:

image

I have tried these steps but had no luck solving the issue: image

Version

Version: 0.10.34

Steps to Reproduce

mentioned in the image above

Relevant Logs/Tracbacks

No response

dosubot[bot] commented 4 months ago

To address the ImportError related to TensorFlow and its estimator package, follow these steps:

  1. Update TensorFlow and TensorFlow-Estimator to ensure compatibility. Use the following commands:

    pip install --upgrade tensorflow tensorflow-estimator
  2. Update Sentence-Transformers to a compatible version with your TensorFlow versions:

    pip install --upgrade sentence-transformers
  3. Verify Dependencies: Use pip list to check the installed versions for compatibility issues.

  4. Environment Isolation: If problems persist, create a new virtual environment and reinstall your dependencies:

    python -m venv myenv
    source myenv/bin/activate  # On Windows use `myenv\Scripts\activate`
    pip install tensorflow tensorflow-estimator sentence-transformers

Ensure there are no direct imports from tensorflow_estimator.python.estimator.api._v1.estimator in your code, as the issue seems to be with the environment setup rather than the code itself.

If the issue remains unresolved, please share more details such as the exact Python code snippet causing the ImportError, the specific versions of Python, TensorFlow, TensorFlow-Estimator, and Sentence-Transformers you're using, and any specific setup required for SentenceTransformersFinetuneEngine that could be causing the ImportError. This information will help in providing a more accurate solution.

#### Details **Sources** - [llama-index-legacy/llama_index/legacy/finetuning/embeddings/sentence_transformer.py](https://github.com/run-llama/llama_index/blob/main/llama-index-legacy/llama_index/legacy/finetuning/embeddings/sentence_transformer.py) - [llama-index-finetuning/pyproject.toml](https://github.com/run-llama/llama_index/blob/main/llama-index-finetuning/pyproject.toml) - [llama-index-integrations/llms/llama-index-llms-ollama/pyproject.toml](https://github.com/run-llama/llama_index/blob/main/llama-index-integrations/llms/llama-index-llms-ollama/pyproject.toml) - [llama-index-integrations/tools/llama-index-tools-google/pyproject.toml](https://github.com/run-llama/llama_index/blob/main/llama-index-integrations/tools/llama-index-tools-google/pyproject.toml) - [llama-index-finetuning/llama_index/finetuning/embeddings/sentence_transformer.py](https://github.com/run-llama/llama_index/blob/main/llama-index-finetuning/llama_index/finetuning/embeddings/sentence_transformer.py)
**About Dosu** This response is meant to be useful and save you time. It is **not meant to be a precise solution**, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

To continue the conversation, mention @dosu.

Miheer99 commented 4 months ago

@dosu

not facing any tensorflow issue and i've tried the other steps

this is the error:

ValueError Traceback (most recent call last) in <cell line: 3>() 1 from llama_index.finetuning import SentenceTransformersFinetuneEngine 2 ----> 3 finetune_engine = SentenceTransformersFinetuneEngine( 4 train_dataset, # Dataset to be trained on 5 model_id="nomic-ai/nomic-embed-text-v1", # HuggingFace reference to base embeddings model

5 frames /usr/local/lib/python3.10/dist-packages/transformers/dynamic_module_utils.py in resolve_trust_remote_code(trust_remote_code, model_name, has_local_code, has_remote_code) 625 626 if has_remote_code and not has_local_code and not trust_remote_code: --> 627 raise ValueError( 628 f"Loading {model_name} requires you to execute the configuration file in that" 629 " repo on your local machine. Make sure you have read the code there to avoid malicious use, then"

ValueError: Loading nomic-ai/nomic-embed-text-v1 requires you to execute the configuration file in that repo on your local machine. Make sure you have read the code there to avoid malicious use, then set the option trust_remote_code=True to remove this error.

dosubot[bot] commented 4 months ago

To resolve the issue you're facing with the SentenceTransformersFinetuneEngine when trying to fine-tune the nomic-ai/nomic-embed-text-v1 model, you'll need to enable the execution of remote code by setting the trust_remote_code parameter to True. This step is necessary because the model you're attempting to use includes custom components or requires executing code during loading, which is disabled by default for security reasons.

Here's an updated snippet of how you should initialize the SentenceTransformersFinetuneEngine with the trust_remote_code parameter:

from llama_index.finetuning import SentenceTransformersFinetuneEngine

# Ensure to set trust_remote_code=True when initializing the finetune engine
finetune_engine = SentenceTransformersFinetuneEngine(
    train_dataset,  # Dataset to be trained on
    model_id="nomic-ai/nomic-embed-text-v1",  # HuggingFace reference to base embeddings model
    trust_remote_code=True  # Allow execution of remote code
)

Please note, it's crucial to be cautious when enabling this setting, especially when loading models from sources that you haven't verified, as it could potentially introduce security risks. Always ensure that you trust the source of the models you are working with and have reviewed any remote code that will be executed as part of the model loading process.

If you encounter any further issues or have additional questions, feel free to ask.

#### Details **About Dosu** This response is meant to be useful and save you time. It is **not meant to be a precise solution**, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

To continue the conversation, mention @dosu.

Miheer99 commented 4 months ago

@dosu i did this but it didnt work

dosubot[bot] commented 4 months ago

If setting trust_remote_code=True didn't resolve the issue, and considering the steps you've already taken, it seems we're dealing with a more complex problem. Given the information and the actions we've discussed, let's consider a few additional troubleshooting steps:

  1. Ensure Correct Model ID: Verify that the model ID nomic-ai/nomic-embed-text-v1 is correctly spelled and that the model exists in the repository you're trying to access. A typo or incorrect model ID could lead to unexpected errors.

  2. Compatibility and Environment Setup: It's crucial to ensure that all your packages and dependencies are up to date and compatible with each other. Since you've already tried updating TensorFlow and Sentence Transformers without success, double-check that your Python environment doesn't have conflicting versions of these libraries installed. Sometimes, global and virtual environment packages can conflict.

  3. Manual Configuration File Execution: Attempt to manually execute or inspect the configuration file associated with the nomic-ai/nomic-embed-text-v1 model. This step can help identify if the issue is with the remote code execution or with the model's configuration itself.

  4. Consult Official Documentation and Community Forums: Sometimes, specific models have unique requirements or known issues that are documented in their official repositories or discussed in community forums. Checking these resources might provide insights or solutions specific to the nomic-ai/nomic-embed-text-v1 model.

  5. Version Requirements: The error and the steps you've taken suggest there might be a version mismatch or a requirement not met by your current setup. Ensure that your Sentence Transformers library is updated to a version compatible with the model you're trying to fine-tune. The version you mentioned, 0.10.34, seems to be unrelated to Sentence Transformers, as their versioning is different. Make sure to install the correct version of Sentence Transformers that supports fine-tuning with the SentenceTransformersFinetuneEngine.

If these steps do not resolve the issue, it would be helpful to have more detailed information about the error message you're receiving after setting trust_remote_code=True. Additionally, knowing the exact environment setup, including the Python version and the versions of all relevant libraries, could provide further clues to the solution.

#### Details **About Dosu** This response is meant to be useful and save you time. It is **not meant to be a precise solution**, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

To continue the conversation, mention @dosu.

Thoams0211 commented 2 months ago

I also face with the same issue. Have you resolve it?

I have solved this issue.

You should add the arg trust_remote_code in the 109 line of python3.9/site-packages/llama_index/core/embeddings/utils.py

embed_model = HuggingFaceEmbedding(
    model_name=model_name, trust_remote_code=True, cache_folder=cache_folder
)

DO NOT change the order of these arguments or still raises the ERROR

What's more, I also need run the command pip install einops

Then my local nomic embedding model works