run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
36.62k stars 5.24k forks source link

[Question]: How to increase the performance of extracting triples from words by using Knowledge Graph Index #9337

Closed nttng207 closed 7 months ago

nttng207 commented 11 months ago

Question Validation

Question

I tried increasing the max triplet per chunk but it just took longer to get the index but query performance didn't improve much. Is there any solution to this problem?

dosubot[bot] commented 11 months ago

🤖

Hello @nttng207! 👋

I'm Dosu, a bot here to assist you while we're waiting for a human maintainer. If you have any questions, need help with bugs, or want to become a contributor, I'm here to help guide you. Nice to meet you! Let's dive into your issue.

To improve the performance of extracting triples from words using the Knowledge Graph Index, you can consider the following approaches:

  1. Optimize the triplet extraction function: The performance of the Knowledge Graph Index is heavily dependent on the efficiency of the triplet extraction function. If you are using a custom function (kg_triplet_extract_fn), ensure that it is optimized for performance. If you are using the default function (_llm_extract_triplets), consider implementing a custom function that is more efficient for your specific use case.

  2. Parallelize the extraction process: The current implementation processes each node sequentially. You can improve performance by processing multiple nodes in parallel. This can be achieved by using Python's multiprocessing or concurrent.futures modules. However, you need to ensure that the shared resources (like the graph store) are thread-safe.

  3. Use a more efficient graph store: The performance of the Knowledge Graph Index can also be affected by the efficiency of the graph store. If you are using the SimpleGraphStore, consider using a more efficient graph store.

  4. Optimize the embedding generation: If you are including embeddings (include_embeddings=True), ensure that the embedding generation process is optimized. If the embedding generation is the bottleneck, consider using a more efficient model or reducing the dimensionality of the embeddings.

Please note that these are general suggestions and their effectiveness can vary depending on your specific use case and data. You can refer to the llama_index/indices/knowledge_graph/base.py for more details.

As for the default function for triplet extraction in the LlamaIndex codebase, it is the extract method in the BaseExtractor class. This function extracts metadata for a sequence of nodes and returns a list of metadata dictionaries corresponding to each node. The actual extraction logic will depend on how the aextract method is implemented in the subclass of BaseExtractor. You can refer to the llama_index/extractors/interface.py for more details.

I hope this helps! If you have any other questions, feel free to ask.

Best, Dosu

Sources


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

nttng207 commented 10 months ago

@dosu-bot Which Graph Store is best ? can you recommend me

dosubot[bot] commented 10 months ago

🤖

Based on your requirements, you can consider using NebulaGraphStore, KuzuGraphStore, Neo4jGraphStore, or FalkorDBGraphStore as alternatives to the SimpleGraphStore. These graph stores are more advanced and potentially more efficient, and are available in the LlamaIndex framework.

Here's an example of how you can use the Neo4jGraphStore:

from llama_index.graph_stores.neo4j import Neo4jGraphStore

# Initialize the Neo4jGraphStore
graph_store = Neo4jGraphStore()

# Use the graph_store for your operations

You can replace Neo4jGraphStore with the appropriate graph store class based on your choice.

Unfortunately, I couldn't find specific features and advantages of NebulaGraphStore, KuzuGraphStore, Neo4jGraphStore, and FalkorDBGraphStore within the LlamaIndex repository. However, these graph stores are generally known for their performance and scalability, which might be beneficial for your application.

For more details, you can refer to the graph_stores/init.py file in the LlamaIndex repository.

If you need more specific information about these graph stores, could you please provide more details about your application's requirements? This would help me recommend the most suitable graph store for your needs.

Sources


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

nttng207 commented 10 months ago

@dosu-bot Is it true that when I use an embedding model in the service_context, when I pass the service_context to the KnowledgeGraphIndex, the extraction of triples is affected by the embedding model I initialized?

dosubot[bot] commented 10 months ago

🤖

Yes, the embedding model initialized in the service_context does indeed affect the extraction of triples in the KnowledgeGraphIndex when using the LlamaIndex framework for LLM applications.

This is because the embed_model from the service_context is used to generate embeddings for the nodes in the _get_node_with_embedding and _aget_node_with_embedding methods. Here's the relevant code:

def _get_node_with_embedding(
    self,
    nodes: Sequence[BaseNode],
    show_progress: bool = False,
    is_image: bool = False,
) -> List[BaseNode]:
    ...
    else:
        id_to_embed_map = embed_nodes(
            nodes,
            embed_model=self._service_context.embed_model,
            show_progress=show_progress,
        )
    ...
async def _aget_node_with_embedding(
    self,
    nodes: Sequence[BaseNode],
    show_progress: bool = False,
    is_image: bool = False,
) -> List[BaseNode]:
    ...
    else:
        id_to_embed_map = await async_embed_nodes(
            nodes,
            embed_model=self._service_context.embed_model,
            show_progress=show_progress,
        )
    ...

In both methods, if the node is not an image, the embed_model from the service_context is used to generate embeddings for the nodes. This implies that the choice of embedding model in the service_context will affect the extraction of triples in the KnowledgeGraphIndex. You can find more details in the llama_index/indices/multi_modal/base.py file.

The embed_model parameter in the embed_nodes and async_embed_nodes functions is used to determine the type of embedding model to be used for the LlamaIndex application. It can be a string or an instance of BaseEmbedding or LCEmbeddings. The exact role of embed_model in these functions cannot be determined from the given context, but you can refer to the llama_index/embeddings/utils.py file for more information.

In the EmbeddingAdapterFinetuneEngine class of the LlamaIndex codebase, the embed_model parameter is an instance of the BaseEmbedding class. It represents the embedding model that will be fine-tuned. You can find more details in the llama_index/finetuning/embeddings/adapter.py file.

I hope this helps! If you have any more questions, feel free to ask.

Sources


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

nttng207 commented 10 months ago

@dosu-bot Can you suggest me some embedding models that effectively support triplet extraction?

dosubot[bot] commented 10 months ago

🤖

In the LlamaIndex framework, several embedding models support triplet extraction in the KnowledgeGraphIndex. These models are provided by OpenAI and are defined in the OpenAIEmbeddingModelType and OpenAIEmbeddingModeModel enums. The models include DAVINCI, CURIE, BABBAGE, ADA, and TEXT_EMBED_ADA_002. Each of these models supports both similarity and text search modes.

Here is the code snippet that defines these models:

class OpenAIEmbeddingModelType(str, Enum):
    """OpenAI embedding model type."""

    DAVINCI = "davinci"
    CURIE = "curie"
    BABBAGE = "babbage"
    ADA = "ada"
    TEXT_EMBED_ADA_002 = "text-embedding-ada-002"

class OpenAIEmbeddingModeModel(str, Enum):
    """OpenAI embedding mode model."""

    # davinci
    TEXT_SIMILARITY_DAVINCI = "text-similarity-davinci-001"
    TEXT_SEARCH_DAVINCI_QUERY = "text-search-davinci-query-001"
    TEXT_SEARCH_DAVINCI_DOC = "text-search-davinci-doc-001"

    # curie
    TEXT_SIMILARITY_CURIE = "text-similarity-curie-001"
    TEXT_SEARCH_CURIE_QUERY = "text-search-curie-query-001"
    TEXT_SEARCH_CURIE_DOC = "text-search-curie-doc-001"

    # babbage
    TEXT_SIMILARITY_BABBAGE = "text-similarity-babbage-001"
    TEXT_SEARCH_BABBAGE_QUERY = "text-search-babbage-query-001"
    TEXT_SEARCH_BABBAGE_DOC = "text-search-babbage-doc-001"

    # ada
    TEXT_SIMILARITY_ADA = "text-similarity-ada-001"
    TEXT_SEARCH_ADA_QUERY = "text-search-ada-query-001"
    TEXT_SEARCH_ADA_DOC = "text-search-ada-doc-001"

    # text-embedding-ada-002
    TEXT_EMBED_ADA_002 = "text-embedding-ada-002"

You can find this in the llama_index/embeddings/openai.py file.

The ServiceContext class initializes the embedding model in its from_defaults and from_service_context class methods. If the embed_model argument is not provided, it defaults to "default". The callback_manager is then set on the embed_model. You can find this in the llama_index/service_context.py file.

The KnowledgeGraphIndex utilizes the embedding models initialized in the service_context for triplet extraction in two main ways:

  1. Triplet Extraction: The KnowledgeGraphIndex uses the service_context to extract triplets from the text. This is done in the _llm_extract_triplets method where the llm model from the service_context is used to predict triplets from the text using the kg_triple_extract_template.

  2. Embedding Generation: If include_embeddings is set to True, the KnowledgeGraphIndex uses the embed_model from the service_context to generate embeddings for the extracted triplets. This is done in the _build_index_from_nodes and _insert methods.

You can find this in the llama_index/indices/knowledge_graph/base.py file.

I hope this helps! If you have any other questions, feel free to ask.

Sources


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

nttng207 commented 10 months ago

@dosu-bot Can you recommend me some way to optimize Knowledge Graph Index, Is there any way to extract as much triplet information as possible?

dosubot[bot] commented 10 months ago

🤖

To optimize the Knowledge Graph Index in the LlamaIndex framework for maximum triplet extraction, you can utilize the SentenceEmbeddingOptimizer class. This class optimizes a text chunk given the query by shortening the input text. Here are some methods you can use:

  1. Percentile Cutoff: This is the percentile cutoff for the top k sentences to use. For example, if you set percentile_cutoff=0.5, this means that the top 50% of sentences will be used.

  2. Threshold Cutoff: This is the threshold cutoff for similarity for each sentence to use. For example, if you set threshold_cutoff=0.7, only sentences with a similarity score higher than the threshold will be used.

  3. Context Before and After: These are the number of sentences before and after the retrieved sentence for further context. You can set context_before and context_after to any integer value.

Here is an example of how to use the SentenceEmbeddingOptimizer:

from llama_index.optimization.optimizer import SentenceEmbeddingOptimizer
optimizer = SentenceEmbeddingOptimizer(
    percentile_cutoff=0.5,
    threshold_cutoff=0.7,
    context_before=1,
    context_after=1
)
query_engine = index.as_query_engine(optimizer=optimizer)
response = query_engine.query("<query_str>")

As for the embedding models, the LlamaIndex framework uses the OpenAIEmbedding model by default. This model is a good choice for most use cases. However, if you want to use a different model, you can pass it as the embed_model parameter when initializing the SentenceEmbeddingOptimizer. The model should be an instance of BaseEmbedding or any of its subclasses. You can find more details in the optimizer.py file.

Regarding the max_triplets_per_chunk parameter, its default value is 10. This parameter determines the maximum number of triplets to extract from a chunk of data. It affects the Knowledge Graph Index performance by limiting the number of triplets that can be extracted from a chunk of data. This can be useful to prevent overloading the system with too many triplets, especially when dealing with large datasets. However, setting this value too low may result in missing out on important information. Therefore, it's important to set this parameter appropriately based on the specific requirements and resources of your application. You can find more details in the base.py file.

I hope this helps! If you have any other questions, feel free to ask.

Sources


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

cmxwu commented 6 months ago

Hi, I realised that https://github.com/run-llama/llama_index/blob/main/llama_index/extractors/interface.py cannot be opened or 404 not found. Is it possible to point me to the source code such that I know better about the functionalities when using llamaIndex. I also noticed a significant slow down in KnowledgeGraphIndex for large documents. Is there a size limit for the document to be used? Does KnowledgeGraphIndex.from_documents accepts tabular structured data? Thank you so much and I appreciate any reply and help!