Closed FortuneW closed 8 months ago
@jerbob92
@FortuneW I think you have to go to pdfium to get this fixed, this way this code works is:
FPDFText_CountRects
to get the amount of rects in the pageFPDFText_GetRect
for each rect to get the coordinates of that rectFPDFText_GetBoundedText
to get the text within that rect I suspect that pdfium has split up the rects for that sentence because of a mix in the font styles, it looks like the 5
is in italic. Because the 5
is in italic and titled to the right, the bounding box overlaps with other text. If you look closely, you see that 5 falls a bit over (
and )
. The actual rects should be: "hibernators (", "5" and ").".
In my opinion this could be fixed in 3 ways:
hibernators (5).
, or:@jerbob92 Got it,Thank you for your analysis and suggestions. I will try some methods to fix such problems
Nice to synchronize with you,
I have rewritten a function similar to GetPageTextStructured using the following function:
So far, it works well.
Nice to synchronize with you,
I have rewritten a function similar to GetPageTextStructured using the following function:
So far, it works well.
Nice to synchronize with you,
I have rewritten a function similar to GetPageTextStructured using the following function:
So far, it works well.
@FortuneW that's a nice solution!
Hi,
I tested and found that the GetPageTextStructured function extracts results with repeat areas
For example, the following :
0001.json 0001.pdf
Here is a specific error:
The original text is: