jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
6.31k stars 647 forks source link

No spaces extracted by chars #780

Closed mon-hur closed 1 year ago

mon-hur commented 1 year ago

Describe the bug

When I extract text with chars method space character doesn't be extracted. But with extract_text method, space is perfectly extracted.

Code to reproduce the problem

pdf.pages[0].chars pdf.pages[0].extract_text()

PDF file

arti_listoutof error.pdf

Expected behavior

extract space with chars too.

Actual behavior

What actually happened, instead?

Screenshots

If applicable, add screenshots to help explain your problem.

Environment

Additional context

Add any other context/notes about the problem here.

jsvine commented 1 year ago

Hi @mon-hur, and thanks for your interest in this library. What you've observed is typical of many PDFs: They don't encode spaces as literal space characters (which is what you get via .chars). That's part of what .extract_text(...) does, which is to infer the spacing between literal characters.

mon-hur commented 1 year ago

Hi, Thanks for your reply.

I used chars method to use size property of chars, because extract_text method doesn't return size data. How can I use size data with extract_text method?? Thanks.

jsvine commented 1 year ago

It would depend on your goal. What specifically are you aiming for?

mon-hur commented 1 year ago

I would like to check font size of text line is bigger than other lines. It needs for checking if that line is title or sub-title(bigger size than normal text).

jsvine commented 1 year ago

Thanks, that's helpful context. I would suggest using .extract_words(extra_attrs=["size"]), which will return a list of word objects and will respect the implicit spacing. Because of the extra_attrs argument, each of those objects will also include the font size (and will break up words that have characters of differing sizes). Then, you can use the top or bottom property to determine which lines they're on.