nlmatics / llmsherpa

Developer APIs to Accelerate LLM Projects
https://www.nlmatics.com
MIT License
1.17k stars 117 forks source link

Add more metadata info - page label and filename? #13

Closed vishhvak closed 8 months ago

vishhvak commented 8 months ago

I'm trying to use it as a pdf reader for llama index, which usually also has details like page label with each document. Anyway to add that info too? How would I go about customizing it to do that myself?

ansukla commented 8 months ago

Yes, this is possible. We will expose the raw json at every node, so that you can get that info while indexing. Could you please share the code you are using to index?

vishhvak commented 8 months ago

I'm using the example from llama_hub

ansukla commented 8 months ago

Got it, will push a PR soon to add meta information to the loader.

vishhvak commented 8 months ago

If I weren't using the loader, how would I add metadata myself (need to get this to work asap for work :) )

ansukla commented 8 months ago

For now, you can simply iterate through the chunks and use the chunk_id here. page_idx and other meta info will be added to the chunks in a subsequent release.

from llama_index.readers.schema.base import Document from llama_index import VectorStoreIndex

index = VectorStoreIndex([]) for chunk_id, chunk in enumerate(doc.chunks()): index.insert(Document(text=chunk.to_context_text(), extra_info={"id": chunk_id})) query_engine = index.as_query_engine()

vishhvak commented 8 months ago

Is there a way to go from json to chunks? If there is, maybe I can manually add page label data that way?

ChintanDonda commented 8 months ago

I've the same requirement. Can we add page number & PDF filename (or url in case of online PDF) as a metadata with each chunk?

I require this to get the SOURCE of information retrieved in the RAG-QA application.

ansukla commented 8 months ago

Metadata is now available via chunk.block_json or chunk.page_idx, chunk.block_idx etc.