[Question]: Custom Metadata extraction || want to add document source hyperlinks in metadata

To incorporate custom metadata to include the document source link for each document in your ingestion pipeline, you can modify the CustomExtractor class to add the source link to the metadata. Here is how you can do it:

Define the source link for each document.
Modify the CustomExtractor to include the source link in the metadata.

Here is the updated code:

from llama_index.core.extractors import BaseExtractor
from llama_index.core import Document

class CustomExtractor(BaseExtractor):
    def __init__(self, source_link):
        self.source_link = source_link

    def extract(self, nodes):
        metadata_list = [
            {
                "custom": (
                    node.metadata["document_title"]
                    + "\n"
                    + node.metadata["excerpt_keywords"]
                ),
                "source_link": self.source_link
            }
            for node in nodes
        ]
        return metadata_list

# Example usage
source_link = "https://example.com/document.pdf"
custom_extractor = CustomExtractor(source_link=source_link)

pipeline = IngestionPipeline(
    transformations=[
        SentenceSplitter(paragraph_separator='\n\n\n', chunk_size=512, chunk_overlap=50, include_metadata=True),
        TitleExtractor(nodes=5, llm=llm),
        SummaryExtractor(summaries=["self"], llm=llm),
        custom_extractor,
        Settings.embed_model,
    ]
)

# Creating a document with custom metadata
document = Document(
    text="This is the document content",
    metadata={"source_link": source_link}
)

In this example, the CustomExtractor class is modified to accept a source_link parameter and include it in the metadata for each node. You can then use this extractor in your ingestion pipeline to add the source link to the metadata of each document ^[1]^[2]^[3]^[4].

To continue talking to Dosu, mention @dosu.

run-llama / llama_index

[Question]: Custom Metadata extraction || want to add document source hyperlinks in metadata #15039

Question Validation

Question