run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
36.76k stars 5.27k forks source link

[Question]: access metadata like author in prompt #16132

Open rankofootball opened 1 month ago

rankofootball commented 1 month ago

Question Validation

Question

I need to include an attribute from each chunk, e.g. the author of the document the chunk is from, in the prompt, like: prompt = "... If the author's name is blabla, then mention ... author: {author} ..."

The author name can be extracted while preparing the doc for chunking and be added to the document per metadata. How can I access the metadata in the prompt?

dosubot[bot] commented 1 month ago

Hey @rankofootball, I'm here to help you with any bugs, questions, or contributions you have! Let's get that issue sorted for you.

To access metadata, such as the author of a document, in a prompt when preparing the document for chunking, you can define a custom extractor. Here is an example of how you can create a custom extractor to include the author metadata:

from llama_index.core.extractors import BaseExtractor

class AuthorExtractor(BaseExtractor):
    async def aextract(self, nodes) -> List[Dict]:
        metadata_list = [
            {
                "author": node.metadata.get("author", "Unknown Author")
            }
            for node in nodes
        ]
        return metadata_list

# Example usage
from llama_index.core.node_parser import TokenTextSplitter
from llama_index.core.ingestion import IngestionPipeline

text_splitter = TokenTextSplitter(separator=" ", chunk_size=512, chunk_overlap=128)

transformations = [
    text_splitter,
    AuthorExtractor(),
    # Add other extractors as needed
]

pipeline = IngestionPipeline(transformations=transformations)

nodes = pipeline.run(documents=documents)

This custom extractor will add the author metadata to each node, which can then be used in your prompt. Additionally, you can combine this with other extractors like TitleExtractor, KeywordExtractor, etc., to enrich the metadata further [1].

For more advanced chunking techniques, you can use tools like Docugami, which breaks down documents into a hierarchical semantic XML tree of chunks. This approach provides a more meaningful representation by following the semantic contours of the document and annotating chunks with additional metadata. This can be particularly useful for high-accuracy Document QA without context window restrictions. For example, you can use the DocugamiReader to load documents and their metadata into LlamaIndex:

from llama_index import download_loader
from base import DocugamiReader

docset_id = "ecxqpipcoe2p"
document_ids = ["43rj0ds7s0ur", "bpc1vibyeke2"]

loader = DocugamiReader()
documents = loader.load_data(docset_id=docset_id, document_ids=document_ids)

The metadata for each Document (really, a chunk of an actual PDF, DOC or DOCX) contains some useful additional information:

  1. id and name: ID and Name of the file (PDF, DOC or DOCX) the chunk is sourced from within Docugami.
  2. xpath: XPath inside the XML representation of the document, for the chunk. Useful for source citations directly to the actual chunk inside the document XML.
  3. structure: Structural attributes of the chunk, e.g. h1, h2, div, table, td, etc. Useful to filter out certain kinds of chunks if needed by the caller.
  4. tag: Semantic tag for the chunk, using various generative and extractive techniques [2][3].

This method ensures that your document chunks are enriched with semantic and structural metadata, facilitating more accurate and meaningful data extraction and retrieval.

To continue talking to Dosu, mention @dosu.