Open galvangoh opened 1 week ago
I'm assuming that you want the to have both the text of each page in markdown and the corresponding page number in memory at one time, and you don't want guess the page number after splitting on \n---\n
.
I'm not part of Llama Index, so I can only tell you the approach we've taken: use the JSON output generated by get_json_result()
. This splits the text by page and creates a structure with both Markdown text and page number. The following work with GPT enabled.
parser = llama_parse.LlamaParse(
result_type='markdown',
api_key=LLAMAPARSE_API_KEY,
gpt4o_mode=True,
gpt4o_api_key=OPENAI_API_KEY,
ignore_errors=False,
)
result = parser.get_json_result('<file.pdf>')[0]['pages']
n = len(result)
print(f"Pages: {n}")
for page in result:
# Use the "md" and "page" fields
print(page["md"])
print(f"Page {page['page']} of {n}")
Inside the for
loop you'd merge the page number and text according to your needs.
@adreichert thank you for sharing your solution. I knew about the get_json_result()
feature which returns the both the page number and md results. However, I am using AzStorageBlobReader
which does not support that (sorry should have also mentioned this when opening the issue).
Thank you for providing more information.
This repo comprises a wrapper for the REST API. It might benefit you to raise the request in repo run-llama/llama_index
, which seems to contain the bulk of the parsing code. This is where class AzStorageBlobReader
is in defined (link).
Sorry I cannot be of more help. Good luck.
@adreichert thank you for your suggestions.
Hello LlamaParse team,
Is it possible to allow LlamaParse to also return, by default, page number?
What I realised after some testing is that the parser treats individual page of a document separately. Currently the results are already giving a page separator but if the md results were to be chunked, indexed and queried by LLM, the LLM is unable to figure out the correct page number the document has.