radkovo / Pdf2Dom

Pdf2Dom is a PDF parser that converts the documents to a HTML DOM representation. The obtained DOM tree may be then serialized to a HTML file or further processed. A command-line utility for converting the PDF documents to HTML is included in the distribution package. Pdf2Dom may be also used as an independent Java library with a standard DOM interface for your DOM-based applications or as an alternative parser for the CSSBox rendering engine in order to add the PDF processing capability to CSSBox. Pdf2Dom is based on the Apache PDFBox™ library.
http://cssbox.sourceforge.net/pdf2dom/
GNU Lesser General Public License v3.0
175 stars 71 forks source link

Fix for matching fonts, incorrectly returns "Arial" instead of "Arial Narrow" #19

Closed clint-journaltech closed 7 years ago

clint-journaltech commented 7 years ago
pdf2dom-fonts-test
m-abboud commented 7 years ago

Ah it's been a minute since I last looked at this project but suspect this PR might have a problem.

I think the full font names are being passed to the findKnownFontFamily method so they're coming in as "Arial+Regular" or something like that which will be broken with this changeset since the method is using equals now.

To fix I think you need to pass just the family name to findKnownFontFamily instead of the full one (and maybe rename to isKnownFontFamily and restructure the logic so it returns a boolean maybe but I digress).

Check if the text.getFont() object up at the top has some sort of getFamily method or look in the FontTable class where I added a font family regex matcher.

Dunno if the findKnownFontFamily method to reduce output size is a good idea at all though as I'm sure some PDFs have Arial fonts that differ from standard and will make the final document look way different...

clint-journaltech commented 7 years ago

My bad, I was manipulating the values sent to findKnownFontFamily, and missed the original '+' joiner.

I couldn't see how to get the family from the text.getFont() object or a way to suitably use the FontTable font family regex matcher. The problem I found was being able to match "Arial" with any suffix value (like "Arial MT") whilst also being able to catch "Arial Narrow" (similarly for "Times" and "Times New Roman PSMT"). The most straight forward solution I could figure was just having the values in an order of precedence in the cssFontFamily array.. so it checked against "Arial Narrow" before looking at "Arial".

clint-journaltech commented 7 years ago

Any chance of getting this merged or further comments please? We'd prefer to use this directly rather than forking our own.

radkovo commented 7 years ago

Thanks for your contribution and I am sorry for the delay. I consider this way of font mapping a temporary solution but I mean your improvement is reasonable until we find a more general way of mapping (well, embedding the fonts is the most general solution but it may have some drawbacks too). I have only one comment before I merge - could you please remove the pom.xml change? Until we publish the 1.7 release, we should not start with 1.8. Thanks for your contribution! @m-abboud please feel free to comment as well as you are probably more familiar with the font issues.

clint-journaltech commented 7 years ago

No worries, thanks! Of note, we have had some problems with embedding the fonts, the alignment and spacing has been off. The pom.xml change has been made.

m-abboud commented 7 years ago

looks fine