klippa-app / go-pdfium

Easy to use PDF library using Go and PDFium
MIT License
195 stars 16 forks source link

api:GetPageTextStructured The extracted information has repeat areas #145

Closed FortuneW closed 8 months ago

FortuneW commented 8 months ago

Hi,

I tested and found that the GetPageTextStructured function extracts results with repeat areas

For example, the following :

0001.json 0001.pdf

Here is a specific error:

  {
   "left": 36.88567352294922,
   "top": 215.13877868652344,
   "right": 87.23951721191406,
   "bottom": 207.6760711669922,
   "text": "hibernators ("
  },
  {
   "left": 87.57408905029297,
   "top": 214.72274780273438,
   "right": 92.2285385131836,
   "bottom": 208.967529296875,
   "text": "5)"
  },
  {
   "left": 92.16873168945312,
   "top": 215.13877868652344,
   "right": 96.52155303955078,
   "bottom": 207.6760711669922,
   "text": "5)."
  },

The original text is: wrong-pos

FortuneW commented 8 months ago

@jerbob92

jerbob92 commented 8 months ago

@FortuneW I think you have to go to pdfium to get this fixed, this way this code works is:

I suspect that pdfium has split up the rects for that sentence because of a mix in the font styles, it looks like the 5 is in italic. Because the 5 is in italic and titled to the right, the bounding box overlaps with other text. If you look closely, you see that 5 falls a bit over ( and ). The actual rects should be: "hibernators (", "5" and ").".

In my opinion this could be fixed in 3 ways:

FortuneW commented 8 months ago

@jerbob92 Got it,Thank you for your analysis and suggestions. I will try some methods to fix such problems

FortuneW commented 8 months ago

Nice to synchronize with you,

I have rewritten a function similar to GetPageTextStructured using the following function:

So far, it works well.

FortuneW commented 8 months ago

Nice to synchronize with you,

I have rewritten a function similar to GetPageTextStructured using the following function:

So far, it works well.

FortuneW commented 8 months ago

Nice to synchronize with you,

I have rewritten a function similar to GetPageTextStructured using the following function:

So far, it works well.

jerbob92 commented 8 months ago

@FortuneW that's a nice solution!