run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
36.57k stars 5.23k forks source link

[Question]: Way to remove some metadata values from vector db (PineconeVectorStore) #11750

Closed ayseozgun closed 4 months ago

ayseozgun commented 8 months ago

Question Validation

Question

I am using sentence window node parser before creating vectors. Then, I am upserting these nodes to the Pinecone. SentenceWindowNodeParser creates default metadatas in pinecone. (_node_content, _node_type, doc_id, document_id, original_text, ref_doc_id, window) image

Is there any way to remove ref_roc_id and doc_id ?

node_parser = SentenceWindowNodeParser.from_defaults( window_size=3, window_metadata_key="window", original_text_metadata_key="original_text", )

Thanks

dosubot[bot] commented 8 months ago

To modify the SentenceWindowNodeParser in LlamaIndex to exclude 'ref_doc_id' and 'doc_id' from the default metadata when creating vectors and upserting nodes to Pinecone, you can add these keys to the lists excluded_embed_metadata_keys and excluded_llm_metadata_keys for each node. This ensures that these metadata keys are not included in the embeddings and LLM operations.

Here's how you can modify the build_window_nodes_from_documents method to achieve this:

def build_window_nodes_from_documents(
    self, documents: Sequence[Document]
) -> List[BaseNode]:
    """Build window nodes from documents."""
    all_nodes: List[BaseNode] = []
    for doc in documents:
        text = doc.text
        text_splits = self.sentence_splitter(text)
        nodes = build_nodes_from_splits(
            text_splits,
            doc,
            id_func=self.id_func,
        )

        # add window to each node
        for i, node in enumerate(nodes):
            window_nodes = nodes[
                max(0, i - self.window_size) : min(i + self.window_size, len(nodes))
            ]

            node.metadata[self.window_metadata_key] = " ".join(
                [n.text for n in window_nodes]
            )
            node.metadata[self.original_text_metadata_key] = node.text
            # exclude window metadata from embed and llm
            node.excluded_embed_metadata_keys.extend(
                [self.window_metadata_key, self.original_text_metadata_key, 'ref_doc_id', 'doc_id']
            )
            node.excluded_llm_metadata_keys.extend(
                [self.window_metadata_key, self.original_text_metadata_key, 'ref_doc_id', 'doc_id']
            )

        all_nodes.extend(nodes)

    return all_nodes

In this modification, 'ref_doc_id' and 'doc_id' are added to the lists excluded_embed_metadata_keys and excluded_llm_metadata_keys within the loop that iterates over each node. This ensures that these keys are excluded from the metadata used in embedding and LLM operations for each node.

The purpose of the excluded_embed_metadata_keys and excluded_llm_metadata_keys lists in the LlamaIndex codebase is to filter out specific metadata keys when generating content representations in different contexts, specifically for embeddings and language model (LLM) processing. These lists are used in the get_metadata_str method of the TextNode class. When generating the string representation of a node's metadata, the method checks the current metadata mode (which can be set to ALL, EMBED, LLM, or NONE). Depending on the mode, it either includes all metadata, excludes keys listed in excluded_embed_metadata_keys (for EMBED mode), excludes keys listed in excluded_llm_metadata_keys (for LLM mode), or includes no metadata at all (for NONE mode).

This functionality allows for flexibility in how metadata is included in the content representation, enabling the exclusion of certain metadata that might not be relevant or could negatively impact the processing or analysis being performed, whether it's for embedding generation or language model processing.

You can find more details in the source code of LlamaIndex:

Sources

#### About Dosu This response is meant to be useful and save you time. It is **not meant to be a precise solution**, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant. If you want to continue the conversation, start your reply with @dosu-bot.