Open rankofootball opened 1 month ago
Hey @rankofootball, I'm here to help you with any bugs, questions, or contributions you have! Let's get that issue sorted for you.
To access metadata, such as the author of a document, in a prompt when preparing the document for chunking, you can define a custom extractor. Here is an example of how you can create a custom extractor to include the author metadata:
from llama_index.core.extractors import BaseExtractor
class AuthorExtractor(BaseExtractor):
async def aextract(self, nodes) -> List[Dict]:
metadata_list = [
{
"author": node.metadata.get("author", "Unknown Author")
}
for node in nodes
]
return metadata_list
# Example usage
from llama_index.core.node_parser import TokenTextSplitter
from llama_index.core.ingestion import IngestionPipeline
text_splitter = TokenTextSplitter(separator=" ", chunk_size=512, chunk_overlap=128)
transformations = [
text_splitter,
AuthorExtractor(),
# Add other extractors as needed
]
pipeline = IngestionPipeline(transformations=transformations)
nodes = pipeline.run(documents=documents)
This custom extractor will add the author metadata to each node, which can then be used in your prompt. Additionally, you can combine this with other extractors like TitleExtractor
, KeywordExtractor
, etc., to enrich the metadata further [1].
For more advanced chunking techniques, you can use tools like Docugami, which breaks down documents into a hierarchical semantic XML tree of chunks. This approach provides a more meaningful representation by following the semantic contours of the document and annotating chunks with additional metadata. This can be particularly useful for high-accuracy Document QA without context window restrictions. For example, you can use the DocugamiReader
to load documents and their metadata into LlamaIndex:
from llama_index import download_loader
from base import DocugamiReader
docset_id = "ecxqpipcoe2p"
document_ids = ["43rj0ds7s0ur", "bpc1vibyeke2"]
loader = DocugamiReader()
documents = loader.load_data(docset_id=docset_id, document_ids=document_ids)
The metadata
for each Document
(really, a chunk of an actual PDF, DOC or DOCX) contains some useful additional information:
This method ensures that your document chunks are enriched with semantic and structural metadata, facilitating more accurate and meaningful data extraction and retrieval.
To continue talking to Dosu, mention @dosu.
Question Validation
Question
I need to include an attribute from each chunk, e.g. the author of the document the chunk is from, in the prompt, like: prompt = "... If the author's name is blabla, then mention ... author: {author} ..."
The author name can be extracted while preparing the doc for chunking and be added to the document per metadata. How can I access the metadata in the prompt?