run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
33.61k stars 4.72k forks source link

[Question]: I want to highlight the souce of extracted data for a pdf, what are the metadata required ? #14045

Open Hritesh-Hexaware opened 1 month ago

Hritesh-Hexaware commented 1 month ago

Question Validation

Question

I am parsing a pdf file and extracting information from it, i want to have a source_link, on click of which the pdf opens up , jumps to specific page no. and source node highlighted

dosubot[bot] commented 1 month ago

To open a PDF, jump to a specific page, and highlight the source node, you need to include the page number and highlight text metadata in the source_link. Here is an example of how you can format the metadata:

for node in response.source_nodes:
    print("-----")
    text_fmt = node.node.get_content().strip().replace("\n", " ")[:1000]
    print(f"Text:\t {text_fmt} ...")
    print(f"Metadata:\t {node.node.metadata}")
    print(f"Score:\t {node.score:.3f}")
    source_link = f"file_path#page={node.node.metadata['page_number']}&highlight={node.node.metadata['highlight_text']}"
    print(f"Source Link:\t {source_link}")

In this example:

The metadata dictionary should contain these keys to generate the appropriate source_link [1].

To continue talking to Dosu, mention @dosu.

logan-markewich commented 1 month ago

the response.source_nodes contains the nodes used to make a response

There, you can access the metadata (which might have the page number). To highlight in a PDF, I would use some fuzzy matching library to compare against the original PDF text, like fuse.js