run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
36.72k stars 5.27k forks source link

[Bug]: Stop at "Generating embeddings: 0it [00:00, ?it/s]" when use SchemaLLMPathExtractor #15065

Open Benny1991-lucky opened 3 months ago

Benny1991-lucky commented 3 months ago

Bug Description

When I run llamaindex Graph, embedding always stop at "Generating embeddings: 0it [00:00, ?it/s]" when use SchemaLLMPathExtractor. No problem with ImplicitPathExtractor and SimpleLLMPathExtractor.

Some codes: kg_extractor = SchemaLLMPathExtractor( llm=Ollama(model="llama3", json_mode=True, request_timeout=3600), possible_entities=entities, possible_relations=relations, kg_validation_schema=validation_schema,

if false, allows for values outside of the schema

# useful for using the schema as a suggestion
strict=False,
max_triplets_per_chunk=10,
num_workers=4,

) index = PropertyGraphIndex.from_documents( docs, kg_extractors=[kg_extractor], show_progress=True, ) Output: Extracting paths from text with schema: 100%|██████████| 22/22 [00:56<00:00, 2.55s/it] Generating embeddings: 100%|██████████| 3/3 [00:00<00:00, 3.04it/s] Generating embeddings: 0it [00:00, ?it/s]

Version

llama-index 0.10.58

Steps to Reproduce

Follow this guidance: https://docs.llamaindex.ai/en/stable/examples/property_graph/property_graph_advanced/

Relevant Logs/Tracbacks

No response

dosubot[bot] commented 3 months ago

Hey @Benny1991-lucky! I'm here to help you with any bugs, questions, or contributions you have regarding the repository. Let's tackle this issue together!

The issue where the embedding process stops at "Generating embeddings: 0it [00:00, ?it/s]" when using SchemaLLMPathExtractor is likely due to the embedding model not being correctly set or used during the index construction. This can happen if the embedding model is not properly configured or passed to the PropertyGraphIndex.

Here are a few steps to ensure the embedding model is correctly set:

  1. Ensure the Embedding Model is Correctly Configured: Make sure you have correctly initialized and passed the embedding model to the PropertyGraphIndex. Here is an example using a Hugging Face embedding model:

    from llama_index.embeddings.huggingface import HuggingFaceEmbeddings
    from llama_index import LangchainEmbedding, PropertyGraphIndex
    
    def get_embed_model():
       hf = HuggingFaceEmbeddings(model_name="all-mpnet-base-v2")
       return LangchainEmbedding(hf)
    
    embed_model = get_embed_model()
    
    kg_extractor = SchemaLLMPathExtractor(
       llm=Ollama(model="llama3", json_mode=True, request_timeout=3600),
       possible_entities=entities,
       possible_relations=relations,
       kg_validation_schema=validation_schema,
       strict=False,
       max_triplets_per_chunk=10,
       num_workers=4,
    )
    
    index = PropertyGraphIndex.from_documents(
       docs,
       kg_extractors=[kg_extractor],
       embed_model=embed_model,
       show_progress=True,
    )
  2. Enable Debug Logging: Enable logging to get more insights into what might be going wrong. This can help you diagnose issues with the embedding process:

    import logging
    import sys
    
    logging.basicConfig(stream=sys.stdout, level=logging.DEBUG, force=True)
    
    from llama_index.embeddings.huggingface import HuggingFaceEmbeddings
    from llama_index import LangchainEmbedding, PropertyGraphIndex
    
    def get_embed_model():
       hf = HuggingFaceEmbeddings(model_name="all-mpnet-base-v2")
       return LangchainEmbedding(hf)
    
    embed_model = get_embed_model()
    
    kg_extractor = SchemaLLMPathExtractor(
       llm=Ollama(model="llama3", json_mode=True, request_timeout=3600),
       possible_entities=entities,
       possible_relations=relations,
       kg_validation_schema=validation_schema,
       strict=False,
       max_triplets_per_chunk=10,
       num_workers=4,
    )
    
    index = PropertyGraphIndex.from_documents(
       docs,
       kg_extractors=[kg_extractor],
       embed_model=embed_model,
       show_progress=True,
    )
  3. Check for Errors or Exceptions: Ensure that there are no errors or exceptions being raised during the embedding process. The SchemaLLMPathExtractor handles errors using try-except blocks, but enabling logging will help you see if any exceptions are being caught and handled silently.

By following these steps, you should be able to diagnose and resolve the issue with the embedding process stopping at "Generating embeddings: 0it [00:00, ?it/s]" [1][2].

To continue talking to Dosu, mention @dosu.

Benny1991-lucky commented 3 months ago

Hi,

After I use the LangchainEmbeddingto return the embedding model, I got this error: AttributeError: 'HuggingFaceEmbedding' object has no attribute 'aembed_documents'

And my HuggingFaceEmbedding class is from llama_index.embeddings.huggingface import HuggingFaceEmbedding, not HuggingFaceEmbeddings.

from llama_index import LangchainEmbedding, PropertyGraphIndex

def get_embed_model():
    hf = HuggingFaceEmbedding(model_name="all-mpnet-base-v2")
    return LangchainEmbedding(hf)
dosubot[bot] commented 3 months ago

To resolve the AttributeError: 'HuggingFaceEmbedding' object has no attribute 'aembed_documents' error, you need to ensure that the HuggingFaceEmbedding class you are using has the aembed_documents method implemented. Here are the steps to address this:

  1. Check if aembed_documents is implemented: Ensure that the HuggingFaceEmbedding class has the aembed_documents method. If it does not, you need to implement it.

  2. Implement aembed_documents: If the method is not implemented, you can add it to the HuggingFaceEmbedding class. Here is an example implementation:

class HuggingFaceEmbedding:
    # Existing methods...

    async def aembed_documents(self, texts: List[str]) -> List[List[float]]:
        # Implement the asynchronous embedding logic here
        # For example, you can use an async version of your embedding model
        return [await self.embed_document_async(text) for text in texts]

    async def embed_document_async(self, text: str) -> List[float]:
        # Example async embedding logic
        # Replace this with the actual async call to your embedding model
        return self.embed_document(text)
  1. Fallback to synchronous method: If you cannot implement the asynchronous method, you can modify the LangchainEmbedding class to fallback to the synchronous method without raising an error. This is already partially handled in the provided code with the _async_not_implemented_warn_once method.

By ensuring that the aembed_documents method is implemented or properly handled, you can resolve the AttributeError [1][2].

To continue talking to Dosu, mention @dosu.

logan-markewich commented 3 months ago

Don't use langchain embeddings, you can use our own huggingface class if you need to

But also, it's not stopping, its finishing fine. But it's indicating that zero kg nodes were extracted. If you are using an open-source llm, this is usually expected as they kind of stink with outputting reliable JSON

My advice is try the DynamicLLMPathExtractor instead if you are using open source https://docs.llamaindex.ai/en/stable/module_guides/indexing/lpg_index_guide/?h=dynamic#dynamicllmpathextractor

Benny1991-lucky commented 3 months ago

@logan-markewich , thanks for support. It works for me to use DynamicLLMPathExtractor.

BTW, what's the different between DynamicLLMPathExtractorand SchemaLLMPathExtractorwhen using the open source LLM?

Benny1991-lucky commented 3 months ago

So I can not use the SchemaLLMPathExtractor with open-source LLM, like llam3:8B, right? Because I already tried the llama3, gemma2 and llama3.1 loaded by ollama. The output always was"Generating embeddings: 0it [00:00, ?it/s]".

br00t4c commented 1 month ago

Don't know if you managed to ever resolve this issue, but I ran into the same! It appears that setting json_mode=True in the Ollama LLM results in json-formatted triplets being produced by almost every recent Ollama model I tested (example: {"subject":"Sam Altman","predicate":"decided not to","object":"be president"}) but the parse_fn in SchemaLLMPathExtractor object expects it to be in raw format (example: ("Sam Altman","decided not to","be president")). When I set json_mode=False , everything started to work as expected!

kg_extractor = SchemaLLMPathExtractor(
    llm=Ollama(model="llama3", json_mode=False, request_timeout=3600),
    possible_entities=entities,
    possible_relations=relations,
    kg_validation_schema=validation_schema,
    strict=False,
    max_triplets_per_chunk=10,
    num_workers=4,
)