pymupdf / RAG

RAG (Retrieval-Augmented Generation) Chatbot Examples Using PyMuPDF
https://pymupdf.readthedocs.io/en/latest/pymupdf4llm
GNU Affero General Public License v3.0
297 stars 57 forks source link

Some pdf pages takes lot of time to converting. #157

Closed imran-pyflow closed 12 hours ago

imran-pyflow commented 17 hours ago

FinalCLStudy.md Page no 59 is taking more than 300seconds to finish.

Check below code. File: pymupdf_rag.py function: get_page_output.

            for i in range(len(img_info) - 1, 0, -1):
                r = img_info[i]["bbox"]
                if r.is_empty:
                    img_info.pop(i)
                    continue
                for j in range(i):  # image areas larger than r
                    if r in img_info[j]["bbox"]:
                        #del img_info[i]  # contained in some larger image
                        img_info.pop(i)
                        break
imran-pyflow commented 16 hours ago

It is calculating image information even If the write_iamges = False

JorjMcKie commented 12 hours ago

If you want to report a bug, always provide a reproducing file. Else, image information is needed in any case to compute text candidate areas. Duration is mainly dependent on the number of vector graphics on page.