metebalci / pdftitle

a utility to extract the title from a PDF file
GNU General Public License v3.0
131 stars 21 forks source link

improve space detection and remove pdfminer high level code #25

Open metebalci opened 3 years ago

metebalci commented 3 years ago

Text in the PDF file might not contain space character but the space might be indicated with an actual (additional) horizontal position difference between the glyphs before and after the space, so between the last char and the first char of the words. pdfminer has a high level code detecting this i.e. if the space between chars is greater than a certain threshold (possibly specified in the font file). It is better to do this manually and also implement spacing if vertical positions also changed (title in more than one lines). When this is done, I think, the get_title_from_io method can be simplified by removing the TextConverter and PDFPageInterpreter related parts.

mdbraber commented 2 years ago

Is it currently expected if a headline spans multiple lines it will fail to output the right format (in my case: words on different lines are joined without spaces)?

mdbraber commented 2 years ago

Figured out what's happening in the current version. I've got PDF titles split over multiple lines, but the lines itself hold spaces, so the statement on line 564 (if " " not in title) doesn't return True. When forcing this it works in my case. Maybe possibly add an argument to force space correction (or just alway correct spaces)?

metebalci commented 2 years ago

Yes it makes sense to add an argument. If possible, can you share the pdf so it can be used to validate this improvement ?

On Sat, 29 Jan 2022 at 18:35, Maarten den Braber @.***> wrote:

Figured out what's happening in the current version. I've got PDF titles split over multiple lines, but the lines itself hold spaces, so the statement on line 564 (if " " not in title) doesn't return True. When forcing this it works in my case. Maybe possibly add an argument to force space correction (or just alway correct spaces)?

— Reply to this email directly, view it on GitHub https://github.com/metebalci/pdftitle/issues/25#issuecomment-1024954150, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGGJB65C7YKQMTI5BHQLRDUYQQOFANCNFSM5B4PVJOQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you were assigned.Message ID: @.***>

mdbraber commented 2 years ago

I've sent the PDF files via e-mail for validation. It works now on some, but not yet on all articles when forcing space correction.