[Bug]: Hugging Face Inference API Fails to Extract Entities with SchemaLLMPathExtractor

Get-Rect commented 1 month ago

Bug Description

When using a HuggingFaceInferenceAPI LLM to build a property graph using the SchemaLLMPathExtractor, no entities or relations are extracted.

I have tried testing this with different models, extract prompts, and settings and have not been able to get it to work, despite the HuggingFaceInferenceAPI LLM working when using a SimpleLLMPathExtractor.

Version

0.10.52

Steps to Reproduce

Clone the example repo here and run it, https://github.com/Get-Rect/property_graph_schema_hugging_face

Or, create a Property Graph Store using a HuggingFaceInferenceAPI LLM and a SchemaLLMPathExtractor. I have only tested this with a connection to a locally run Neo4J server, but it may persist for other data stores.

Relevant Logs/Tracbacks

No response

dosubot[bot] commented 1 month ago

Hey there, @Get-Rect! I'm here to help you with any bugs, questions, or contributions you have. Let's squash those bugs together!

The issue you're encountering with the HuggingFaceInferenceAPI LLM failing to extract entities and relations when using the SchemaLLMPathExtractor is likely due to the strict validation enforced by the SchemaLLMPathExtractor. This extractor requires entities and relations to conform to a predefined schema, which might not be the case with the data being processed. In contrast, the SimpleLLMPathExtractor uses a simpler approach without strict validation, allowing it to work more flexibly ^[1]^[2].

To address this issue, you can try the following steps:

Adjust the Schema Validation: Modify the SchemaLLMPathExtractor to be less strict by setting the strict parameter to False. This can help bypass the schema validation and allow the extraction to proceed.

from llama_index.core.indices.property_graph.transformations.schema_llm import SchemaLLMPathExtractor

extractor = SchemaLLMPathExtractor(
    llm=your_llm_instance,
    extract_prompt=your_custom_prompt,
    possible_entities=["Entity1", "Entity2"],
    possible_relations=["Relation1", "Relation2"],
    strict=False,  # Disable strict validation
    max_triplets_per_chunk=5,
    num_workers=2
)

Custom Prompts and Parsing Functions: Customize the prompts and parsing functions to better suit your data and extraction needs. This can help improve the performance and accuracy of the extraction process.

from llama_index.core.prompts import PromptTemplate
from llama_index.core.indices.property_graph.utils import custom_parse_triplets_fn

custom_prompt = PromptTemplate("Your custom prompt here")
custom_parse_fn = custom_parse_triplets_fn

extractor = SchemaLLMPathExtractor(
    llm=your_llm_instance,
    extract_prompt=custom_prompt,
    parse_fn=custom_parse_fn,
    possible_entities=["Entity1", "Entity2"],
    possible_relations=["Relation1", "Relation2"],
    strict=False,
    max_triplets_per_chunk=5,
    num_workers=2
)

Logging and Debugging: Enable detailed logging and use the callback handler to trace and debug the extraction process. This can help identify specific issues and areas for improvement.

import logging
import sys

logging.basicConfig(stream=sys.stdout, level=logging.DEBUG)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))

import llama_index.core
llama_index.core.set_global_handler("simple")

By making these adjustments, you should be able to improve the extraction performance and potentially resolve the issue with the HuggingFaceInferenceAPI LLM and SchemaLLMPathExtractor ^[3]^[4].

To continue talking to Dosu, mention @dosu.

logan-markewich commented 1 month ago

@Get-Rect the model you are using is probably just not good enough to output consistent pydantic objects

I recommend not using this module for open-source LLMs, unless its like llama3.1 70b

Ollama supports tool calling recently. Highly recommended

Get-Rect commented 1 month ago

@logan-markewich Thank you for the advice. I have tried it with much bigger models like llama3.1-405b and had the same empty results. I have even set Strict=False as Dosu recommended. What's weird is that I'm only running llama3 7b locally with Ollama and getting much better results than any model when I use the inference api.

I will keep trying custom prompts and parsing functions and check the logs as Dosu recommended. Any other tips you can offer would be greatly appreciated.

run-llama / llama_index