Open papandadj opened 5 days ago
I've got the same issues.
same issues
It will not lead to the loss of text box. Only when the text boxes being identified as reference/header/footer will be removed.
I don’t think it’s as you said. In the code, when two boxes overlap, with a large box covering a small box, if the score of the small box is higher than that of the large box, it will cause the large box to be deleted.
Could you debug this file and set a breakpoint at the line marked with a red line to test it?
Is there an existing issue for the same bug?
Branch name
any
Commit ID
any
Other environment information
No response
Actual behavior
Recently, I discovered that there is a significant data loss issue when parsing PDF files. After debugging the related files, I found that when the layout recognizes many boxes, if boxes a and b have containment conditions and do not meet condition
it leads to a large amount of data being deleted.
for example: the layout recognizes is that:
The two red boxes overlap, but they do not meet the above condition, which will result in the box with the lower score being deleted.
I found that other people seem to have encountered this problem as well, but I’m not sure if it’s the same issue. https://github.com/infiniflow/ragflow/issues/1057
The overlapping ratio of the small red boxes on the surface is very close to 1.
Would changing the less-than sign to a greater-than sign here be better? this is to said If the overlapping part occupies the vast majority of both boxes’ areas, then one of them needs to be deleted.
like this:
BUT would this cause a lot of data duplication? I’m not sure if the layout model filters out such cases.”
Expected behavior
No response
Steps to reproduce
Additional information
No response