Closed ayseozgun closed 4 months ago
To modify the SentenceWindowNodeParser
in LlamaIndex to exclude 'ref_doc_id' and 'doc_id' from the default metadata when creating vectors and upserting nodes to Pinecone, you can add these keys to the lists excluded_embed_metadata_keys
and excluded_llm_metadata_keys
for each node. This ensures that these metadata keys are not included in the embeddings and LLM operations.
Here's how you can modify the build_window_nodes_from_documents
method to achieve this:
def build_window_nodes_from_documents(
self, documents: Sequence[Document]
) -> List[BaseNode]:
"""Build window nodes from documents."""
all_nodes: List[BaseNode] = []
for doc in documents:
text = doc.text
text_splits = self.sentence_splitter(text)
nodes = build_nodes_from_splits(
text_splits,
doc,
id_func=self.id_func,
)
# add window to each node
for i, node in enumerate(nodes):
window_nodes = nodes[
max(0, i - self.window_size) : min(i + self.window_size, len(nodes))
]
node.metadata[self.window_metadata_key] = " ".join(
[n.text for n in window_nodes]
)
node.metadata[self.original_text_metadata_key] = node.text
# exclude window metadata from embed and llm
node.excluded_embed_metadata_keys.extend(
[self.window_metadata_key, self.original_text_metadata_key, 'ref_doc_id', 'doc_id']
)
node.excluded_llm_metadata_keys.extend(
[self.window_metadata_key, self.original_text_metadata_key, 'ref_doc_id', 'doc_id']
)
all_nodes.extend(nodes)
return all_nodes
In this modification, 'ref_doc_id'
and 'doc_id'
are added to the lists excluded_embed_metadata_keys
and excluded_llm_metadata_keys
within the loop that iterates over each node. This ensures that these keys are excluded from the metadata used in embedding and LLM operations for each node.
The purpose of the excluded_embed_metadata_keys
and excluded_llm_metadata_keys
lists in the LlamaIndex codebase is to filter out specific metadata keys when generating content representations in different contexts, specifically for embeddings and language model (LLM) processing. These lists are used in the get_metadata_str
method of the TextNode
class. When generating the string representation of a node's metadata, the method checks the current metadata mode (which can be set to ALL, EMBED, LLM, or NONE). Depending on the mode, it either includes all metadata, excludes keys listed in excluded_embed_metadata_keys
(for EMBED mode), excludes keys listed in excluded_llm_metadata_keys
(for LLM mode), or includes no metadata at all (for NONE mode).
This functionality allows for flexibility in how metadata is included in the content representation, enabling the exclusion of certain metadata that might not be relevant or could negatively impact the processing or analysis being performed, whether it's for embedding generation or language model processing.
You can find more details in the source code of LlamaIndex:
Question Validation
Question
I am using sentence window node parser before creating vectors. Then, I am upserting these nodes to the Pinecone. SentenceWindowNodeParser creates default metadatas in pinecone. (_node_content, _node_type, doc_id, document_id, original_text, ref_doc_id, window)
Is there any way to remove ref_roc_id and doc_id ?
node_parser = SentenceWindowNodeParser.from_defaults( window_size=3, window_metadata_key="window", original_text_metadata_key="original_text", )
Thanks