run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
36.33k stars 5.19k forks source link

[Question]: Custom Metadata extraction || want to add document source hyperlinks in metadata #15039

Open avisheka93 opened 2 months ago

avisheka93 commented 2 months ago

Question Validation

Question

pipeline = IngestionPipeline( transformations=[ SentenceSplitter(paragraph_separator='\n\n\n',chunk_size=512, chunk_overlap=50, include_metadata = True), TitleExtractor(nodes=5, llm=llm), SummaryExtractor(summaries=["self"], llm=llm), Settings.embed_model, ])

i want to add custom metadata of adding document source link for each document how to incorporate that?

dosubot[bot] commented 2 months ago

To incorporate custom metadata to include the document source link for each document in your ingestion pipeline, you can modify the CustomExtractor class to add the source link to the metadata. Here is how you can do it:

  1. Define the source link for each document.
  2. Modify the CustomExtractor to include the source link in the metadata.

Here is the updated code:

from llama_index.core.extractors import BaseExtractor
from llama_index.core import Document

class CustomExtractor(BaseExtractor):
    def __init__(self, source_link):
        self.source_link = source_link

    def extract(self, nodes):
        metadata_list = [
            {
                "custom": (
                    node.metadata["document_title"]
                    + "\n"
                    + node.metadata["excerpt_keywords"]
                ),
                "source_link": self.source_link
            }
            for node in nodes
        ]
        return metadata_list

# Example usage
source_link = "https://example.com/document.pdf"
custom_extractor = CustomExtractor(source_link=source_link)

pipeline = IngestionPipeline(
    transformations=[
        SentenceSplitter(paragraph_separator='\n\n\n', chunk_size=512, chunk_overlap=50, include_metadata=True),
        TitleExtractor(nodes=5, llm=llm),
        SummaryExtractor(summaries=["self"], llm=llm),
        custom_extractor,
        Settings.embed_model,
    ]
)

# Creating a document with custom metadata
document = Document(
    text="This is the document content",
    metadata={"source_link": source_link}
)

In this example, the CustomExtractor class is modified to accept a source_link parameter and include it in the metadata for each node. You can then use this extractor in your ingestion pipeline to add the source link to the metadata of each document [1][2][3][4].

To continue talking to Dosu, mention @dosu.