Open avisheka93 opened 2 months ago
To incorporate custom metadata to include the document source link for each document in your ingestion pipeline, you can modify the CustomExtractor
class to add the source link to the metadata. Here is how you can do it:
CustomExtractor
to include the source link in the metadata.Here is the updated code:
from llama_index.core.extractors import BaseExtractor
from llama_index.core import Document
class CustomExtractor(BaseExtractor):
def __init__(self, source_link):
self.source_link = source_link
def extract(self, nodes):
metadata_list = [
{
"custom": (
node.metadata["document_title"]
+ "\n"
+ node.metadata["excerpt_keywords"]
),
"source_link": self.source_link
}
for node in nodes
]
return metadata_list
# Example usage
source_link = "https://example.com/document.pdf"
custom_extractor = CustomExtractor(source_link=source_link)
pipeline = IngestionPipeline(
transformations=[
SentenceSplitter(paragraph_separator='\n\n\n', chunk_size=512, chunk_overlap=50, include_metadata=True),
TitleExtractor(nodes=5, llm=llm),
SummaryExtractor(summaries=["self"], llm=llm),
custom_extractor,
Settings.embed_model,
]
)
# Creating a document with custom metadata
document = Document(
text="This is the document content",
metadata={"source_link": source_link}
)
In this example, the CustomExtractor
class is modified to accept a source_link
parameter and include it in the metadata for each node. You can then use this extractor in your ingestion pipeline to add the source link to the metadata of each document [1][2][3][4].
To continue talking to Dosu, mention @dosu.
Question Validation
Question
pipeline = IngestionPipeline( transformations=[ SentenceSplitter(paragraph_separator='\n\n\n',chunk_size=512, chunk_overlap=50, include_metadata = True), TitleExtractor(nodes=5, llm=llm), SummaryExtractor(summaries=["self"], llm=llm), Settings.embed_model, ])
i want to add custom metadata of adding document source link for each document how to incorporate that?