Closed vishhvak closed 8 months ago
Yes, this is possible. We will expose the raw json at every node, so that you can get that info while indexing. Could you please share the code you are using to index?
Got it, will push a PR soon to add meta information to the loader.
If I weren't using the loader, how would I add metadata myself (need to get this to work asap for work :) )
For now, you can simply iterate through the chunks and use the chunk_id here. page_idx and other meta info will be added to the chunks in a subsequent release.
from llama_index.readers.schema.base import Document from llama_index import VectorStoreIndex
index = VectorStoreIndex([]) for chunk_id, chunk in enumerate(doc.chunks()): index.insert(Document(text=chunk.to_context_text(), extra_info={"id": chunk_id})) query_engine = index.as_query_engine()
Is there a way to go from json to chunks? If there is, maybe I can manually add page label data that way?
I've the same requirement. Can we add page number & PDF filename (or url in case of online PDF) as a metadata with each chunk?
I require this to get the SOURCE of information retrieved in the RAG-QA application.
Metadata is now available via chunk.block_json or chunk.page_idx, chunk.block_idx etc.
I'm trying to use it as a pdf reader for llama index, which usually also has details like page label with each document. Anyway to add that info too? How would I go about customizing it to do that myself?