pymupdf / RAG

RAG (Retrieval-Augmented Generation) Chatbot Examples Using PyMuPDF
https://jorjmckie.readthedocs.io/en/latest/
GNU Affero General Public License v3.0
187 stars 41 forks source link

Many text block became image since the version 0.0.3 #23

Closed papipsycho closed 2 months ago

papipsycho commented 2 months ago

Hello,

i juste updated to the version 0.0.3 from 0.0.1 and now many text is detected as images

Screenshot 2024-05-23 at 18 05 26

for instance here the text inside the gray box is now detect as image

############## 1.2.3 Exclusions

![TA_3_062002_2.pdf-13-0.png](TA_3_062002_2.pdf-13-0.png)

i try to set the write_images=False) but it doesn't change anything

JorjMcKie commented 2 months ago

This happens because of the background color gray. Images and vector graphics (and color gray is implemented as a vector graphic) are likewise treated as images. This is a known issue: we need to differentiate between simple background coloring and actual vector graphics.

JorjMcKie commented 2 months ago

Can you let us have the example file so we can test the improvement?

papipsycho commented 2 months ago

Hello,

Thanks for you answers,

Untitled.pdf

here i created a pdf with only one page with the issue

JorjMcKie commented 2 months ago

Thank you. I will look into it and share an improvement with you here.

papipsycho commented 2 months ago

Btw i realize you migrate the code pdf4llm,

so just to gave you more information, with the version 0.0.1 of pymupdf4llm and version 0.0.7 of pdf4llm is working

JorjMcKie commented 2 months ago

I have a fix for the problem (not published yet): If setting write_images=False no image references in the MD text will be generated and text written upon images and unicolor vector graphics (i.e. background coloring) will be output. BTW in your example file, the large gray area is an image!

jakubkovac commented 2 months ago

I'm also trying to figure out the fix for this. @JorjMcKie how would you reason about a following PDF testing_vector_graphics.pdf It was created using MS Word, just pasted text from a lorem ipsum generator, highlighting some parts of the text within word and adding one raster image and one svg image that contains text. My goal is to extract markdown from the document where all of the text will be present as well as the raster image. Ideally the vector graphics would get converted to raster. At the moment this is wrongly being considered as a table, that I can fix by modifying the code of to_markdown to pass additional arguments such as strategy to the function find_tables. However to me, the most important is to figure out the distinction between vector graphics and what word produces that looks like vector graphics, the background rectangles.

JorjMcKie commented 2 months ago

@jakubkovac Your complete text has some background color - be it white of something else. This makes it impossible for the graphics clustering to separate out that little circle and friends. Instead, there is one big graphics rectangle covering everything. Table detection can use strategy="lines_strict" with any actual detection loss. So in the end, this is the maximum what can be achieved: image

papipsycho commented 2 months ago

I have a fix for the problem (not published yet): If setting write_images=False no image references in the MD text will be generated and text written upon images and unicolor vector graphics (i.e. background coloring) will be output. BTW in your example file, the large gray area is an image!

Yes i cannot really provide the pdf, so extract the page convert to word to being able to modify data then save as pdf again,

I have a fix for the problem (not published yet): If setting write_images=False => you mean it should be the behavior in the version 0.0.3 or with your fix ?

JorjMcKie commented 2 months ago

The fix will be present in the next version 0.0.4. Your example as md will look like this (write_images=False): Untitled.pdf.md

JorjMcKie commented 2 months ago

If you want to try this out, here is the changed file pymupdf_rag.zip Put it in folder site-packages/pymupdf4llm/helpers replacing what is there.

papipsycho commented 2 months ago

Thanks you for your fix, is now all good,

i remark you left some print here is the result :

before: len(vg_clusters)=0
len(tab_rects0)=0, len(vg_clusters0)=0
before: len(vg_clusters)=0
len(tab_rects0)=0, len(vg_clusters0)=0
before: len(vg_clusters)=0
len(tab_rects0)=0, len(vg_clusters0)=0
before: len(vg_clusters)=0
len(tab_rects0)=0, len(vg_clusters0)=0
before: len(vg_clusters)=0
len(tab_rects0)=0, len(vg_clusters0)=0
before: len(vg_clusters)=0
len(tab_rects0)=0, len(vg_clusters0)=0
before: len(vg_clusters)=0
len(tab_rects0)=2, len(vg_clusters0)=0
before: len(vg_clusters)=0
len(tab_rects0)=0, len(vg_clusters0)=0
before: len(vg_clusters)=0
len(tab_rects0)=0, len(vg_clusters0)=0
before: len(vg_clusters)=0
len(tab_rects0)=0, len(vg_clusters0)=0
before: len(vg_clusters)=0
len(tab_rects0)=0, len(vg_clusters0)=0
before: len(vg_clusters)=0
len(tab_rects0)=0, len(vg_clusters0)=0
before: len(vg_clusters)=0
len(tab_rects0)=1, len(vg_clusters0)=0
before: len(vg_clusters)=0
len(tab_rects0)=0, len(vg_clusters0)=0
before: len(vg_clusters)=0
len(tab_rects0)=3, len(vg_clusters0)=0
before: len(vg_clusters)=0
len(tab_rects0)=1, len(vg_clusters0)=0
before: len(vg_clusters)=0
len(tab_rects0)=0, len(vg_clusters0)=0
before: len(vg_clusters)=0
len(tab_rects0)=1, len(vg_clusters0)=0
before: len(vg_clusters)=0
len(tab_rects0)=0, len(vg_clusters0)=0
before: len(vg_clusters)=0
len(tab_rects0)=2, len(vg_clusters0)=0
before: len(vg_clusters)=0
len(tab_rects0)=1, len(vg_clusters0)=0
before: len(vg_clusters)=0
len(tab_rects0)=0, len(vg_clusters0)=0
before: len(vg_clusters)=0
len(tab_rects0)=1, len(vg_clusters0)=0
before: len(vg_clusters)=0
len(tab_rects0)=0, len(vg_clusters0)=0
before: len(vg_clusters)=0
len(tab_rects0)=0, len(vg_clusters0)=0
before: len(vg_clusters)=0
len(tab_rects0)=0, len(vg_clusters0)=0
before: len(vg_clusters)=0
len(tab_rects0)=0, len(vg_clusters0)=0
before: len(vg_clusters)=0
len(tab_rects0)=0, len(vg_clusters0)=0
before: len(vg_clusters)=0
len(tab_rects0)=0, len(vg_clusters0)=0
before: len(vg_clusters)=0
len(tab_rects0)=0, len(vg_clusters0)=0
before: len(vg_clusters)=0
len(tab_rects0)=0, len(vg_clusters0)=0
before: len(vg_clusters)=0
len(tab_rects0)=0, len(vg_clusters0)=0
before: len(vg_clusters)=0
len(tab_rects0)=0, len(vg_clusters0)=0
before: len(vg_clusters)=0
len(tab_rects0)=3, len(vg_clusters0)=0
before: len(vg_clusters)=0
len(tab_rects0)=1, len(vg_clusters0)=0
before: len(vg_clusters)=0
len(tab_rects0)=0, len(vg_clusters0)=0
before: len(vg_clusters)=0
len(tab_rects0)=0, len(vg_clusters0)=0
before: len(vg_clusters)=0
len(tab_rects0)=0, len(vg_clusters0)=0
before: len(vg_clusters)=0
len(tab_rects0)=0, len(vg_clusters0)=0
before: len(vg_clusters)=0
len(tab_rects0)=0, len(vg_clusters0)=0
before: len(vg_clusters)=0
len(tab_rects0)=0, len(vg_clusters0)=0
before: len(vg_clusters)=0
len(tab_rects0)=0, len(vg_clusters0)=0
before: len(vg_clusters)=0
len(tab_rects0)=0, len(vg_clusters0)=0
before: len(vg_clusters)=0
len(tab_rects0)=0, len(vg_clusters0)=0
before: len(vg_clusters)=0
len(tab_rects0)=0, len(vg_clusters0)=0
before: len(vg_clusters)=0
len(tab_rects0)=0, len(vg_clusters0)=0
before: len(vg_clusters)=0
len(tab_rects0)=0, len(vg_clusters0)=0
before: len(vg_clusters)=0
len(tab_rects0)=1, len(vg_clusters0)=0
before: len(vg_clusters)=0
len(tab_rects0)=0, len(vg_clusters0)=0
before: len(vg_clusters)=0
len(tab_rects0)=0, len(vg_clusters0)=0
before: len(vg_clusters)=0
len(tab_rects0)=0, len(vg_clusters0)=0
before: len(vg_clusters)=0
len(tab_rects0)=0, len(vg_clusters0)=0
before: len(vg_clusters)=0
len(tab_rects0)=0, len(vg_clusters0)=0
before: len(vg_clusters)=0
len(tab_rects0)=0, len(vg_clusters0)=0
before: len(vg_clusters)=0
len(tab_rects0)=0, len(vg_clusters0)=0
before: len(vg_clusters)=0
len(tab_rects0)=0, len(vg_clusters0)=0
before: len(vg_clusters)=0
len(tab_rects0)=1, len(vg_clusters0)=0
before: len(vg_clusters)=0
len(tab_rects0)=0, len(vg_clusters0)=0
before: len(vg_clusters)=0
len(tab_rects0)=0, len(vg_clusters0)=0
JorjMcKie commented 2 months ago

Fixed in version 0.0.4.