api:GetPageTextStructured The extracted information has repeat areas

FortuneW commented 8 months ago

Hi,

I tested and found that the GetPageTextStructured function extracts results with repeat areas

For example, the following ：

Here is a specific error:

  {
   "left": 36.88567352294922,
   "top": 215.13877868652344,
   "right": 87.23951721191406,
   "bottom": 207.6760711669922,
   "text": "hibernators ("
  },
  {
   "left": 87.57408905029297,
   "top": 214.72274780273438,
   "right": 92.2285385131836,
   "bottom": 208.967529296875,
   "text": "5)"
  },
  {
   "left": 92.16873168945312,
   "top": 215.13877868652344,
   "right": 96.52155303955078,
   "bottom": 207.6760711669922,
   "text": "5)."
  },

The original text is： wrong-pos

FortuneW commented 8 months ago

@jerbob92

jerbob92 commented 8 months ago

@FortuneW I think you have to go to pdfium to get this fixed, this way this code works is:

Use FPDFText_CountRects to get the amount of rects in the page
Use FPDFText_GetRect for each rect to get the coordinates of that rect
Use FPDFText_GetBoundedText to get the text within that rect

I suspect that pdfium has split up the rects for that sentence because of a mix in the font styles, it looks like the 5 is in italic. Because the 5 is in italic and titled to the right, the bounding box overlaps with other text. If you look closely, you see that 5 falls a bit over ( and ). The actual rects should be: "hibernators (", "5" and ").".

In my opinion this could be fixed in 3 ways:

pdfium should not split the rects based on the style and this should just be one rect with the text hibernators (5)., or:
pdfium should return a proper rect (so not a straight square but also a bit titled), or:
pdfium should provide a method to directly get the text of a rect (so without the coordinate step in between)

FortuneW commented 8 months ago

@jerbob92 Got it,Thank you for your analysis and suggestions. I will try some methods to fix such problems

FortuneW commented 8 months ago

Nice to synchronize with you,

I have rewritten a function similar to GetPageTextStructured using the following function:

FPDFPage_CountObjects
FPDFPage_GetObject
FPDFPageObj_GetType
FPDFFormObj_CountObjects
FPDFFormObj_GetObject
FPDFTextObj_GetText
FPDFPageObj_GetBounds

So far, it works well.

FortuneW commented 8 months ago

Nice to synchronize with you,

I have rewritten a function similar to GetPageTextStructured using the following function:

FPDFPage_CountObjects
FPDFPage_GetObject
FPDFPageObj_GetType
FPDFFormObj_CountObjects
FPDFFormObj_GetObject
FPDFTextObj_GetText
FPDFPageObj_GetBounds

So far, it works well.

FortuneW commented 8 months ago

Nice to synchronize with you,

I have rewritten a function similar to GetPageTextStructured using the following function:

FPDFPage_CountObjects
FPDFPage_GetObject
FPDFPageObj_GetType
FPDFFormObj_CountObjects
FPDFFormObj_GetObject
FPDFTextObj_GetText
FPDFPageObj_GetBounds

So far, it works well.

jerbob92 commented 8 months ago

@FortuneW that's a nice solution!

klippa-app / go-pdfium

api:GetPageTextStructured The extracted information has repeat areas #145