run-llama / llama_parse

Parse files for optimal RAG
https://www.llamaindex.ai
MIT License
1.79k stars 157 forks source link

Feature Request: `result_type='markdown'` return page number #246

Open galvangoh opened 1 week ago

galvangoh commented 1 week ago

Hello LlamaParse team,

Is it possible to allow LlamaParse to also return, by default, page number?

What I realised after some testing is that the parser treats individual page of a document separately. Currently the results are already giving a page separator but if the md results were to be chunked, indexed and queried by LLM, the LLM is unable to figure out the correct page number the document has.

adreichert commented 1 week ago

I'm assuming that you want the to have both the text of each page in markdown and the corresponding page number in memory at one time, and you don't want guess the page number after splitting on \n---\n.

I'm not part of Llama Index, so I can only tell you the approach we've taken: use the JSON output generated by get_json_result(). This splits the text by page and creates a structure with both Markdown text and page number. The following work with GPT enabled.

    parser = llama_parse.LlamaParse(
        result_type='markdown',
        api_key=LLAMAPARSE_API_KEY,
        gpt4o_mode=True,
        gpt4o_api_key=OPENAI_API_KEY,
        ignore_errors=False,
    )
    result = parser.get_json_result('<file.pdf>')[0]['pages']
    n = len(result)
    print(f"Pages: {n}")
    for page in result:
        # Use the "md" and "page" fields
        print(page["md"])
        print(f"Page {page['page']} of {n}")

Inside the for loop you'd merge the page number and text according to your needs.

galvangoh commented 1 week ago

@adreichert thank you for sharing your solution. I knew about the get_json_result() feature which returns the both the page number and md results. However, I am using AzStorageBlobReader which does not support that (sorry should have also mentioned this when opening the issue).

adreichert commented 1 week ago

Thank you for providing more information.

This repo comprises a wrapper for the REST API. It might benefit you to raise the request in repo run-llama/llama_index, which seems to contain the bulk of the parsing code. This is where class AzStorageBlobReader is in defined (link).

Sorry I cannot be of more help. Good luck.

galvangoh commented 1 week ago

@adreichert thank you for your suggestions.