infiniflow / ragflow

RAGFlow is an open-source RAG (Retrieval-Augmented Generation) engine based on deep document understanding.
https://ragflow.io
Apache License 2.0
11.06k stars 1.07k forks source link

[Bug]: When the layout model recognizes boxes with overlapping parts, it can lead to a large number of pdf text loss #1328

Open papandadj opened 5 days ago

papandadj commented 5 days ago

Is there an existing issue for the same bug?

Branch name

any

Commit ID

any

Other environment information

No response

Actual behavior

Recently, I discovered that there is a significant data loss issue when parsing PDF files. After debugging the related files, I found that when the layout recognizes many boxes, if boxes a and b have containment conditions and do not meet condition

if Recognizer.overlapped_area(layouts[i], layouts[j]) < thr \
       and Recognizer.overlapped_area(layouts[j], layouts[i]) < thr

it leads to a large amount of data being deleted.

for example: the layout recognizes is that:

image

The two red boxes overlap, but they do not meet the above condition, which will result in the box with the lower score being deleted.

image

I found that other people seem to have encountered this problem as well, but I’m not sure if it’s the same issue. https://github.com/infiniflow/ragflow/issues/1057

The overlapping ratio of the small red boxes on the surface is very close to 1.

Would changing the less-than sign to a greater-than sign here be better? this is to said If the overlapping part occupies the vast majority of both boxes’ areas, then one of them needs to be deleted.

like this:

  if Recognizer.overlapped_area(layouts[i], layouts[j]) > thr \
         or Recognizer.overlapped_area(layouts[j], layouts[i]) > thr:
      Delete the box with the smaller area 
  else:
      i += 1
      continue

BUT would this cause a lot of data duplication? I’m not sure if the layout model filters out such cases.”

Expected behavior

No response

Steps to reproduce

null

Additional information

No response

cyhasuka commented 5 days ago

I've got the same issues.

awesomeboy2 commented 5 days ago

same issues

KevinHuSh commented 4 days ago

It will not lead to the loss of text box. Only when the text boxes being identified as reference/header/footer will be removed.

papandadj commented 4 days ago

I don’t think it’s as you said. In the code, when two boxes overlap, with a large box covering a small box, if the score of the small box is higher than that of the large box, it will cause the large box to be deleted.

test2.pdf

image

Could you debug this file and set a breakpoint at the line marked with a red line to test it?