radkovo / Pdf2Dom

Pdf2Dom is a PDF parser that converts the documents to a HTML DOM representation. The obtained DOM tree may be then serialized to a HTML file or further processed. A command-line utility for converting the PDF documents to HTML is included in the distribution package. Pdf2Dom may be also used as an independent Java library with a standard DOM interface for your DOM-based applications or as an alternative parser for the CSSBox rendering engine in order to add the PDF processing capability to CSSBox. Pdf2Dom is based on the Apache PDFBox™ library.
http://cssbox.sourceforge.net/pdf2dom/
GNU Lesser General Public License v3.0
179 stars 71 forks source link

embedded fonts, Bare CFF and TTF formats #8

Closed m-abboud closed 8 years ago

m-abboud commented 8 years ago

Discussed in issue #7.

This pull request adds handling for Bare CFF fonts with separate font conversion lib and TTF which can be just used as is by browsers and what you did originally in the branch. Other font types are still not supported.

radkovo commented 8 years ago

Great, thank you! BTW do you have some testing PDF containing the CFF font? I have tested several random files but all of them contain Type 1 or TrueType fonts.

radkovo commented 8 years ago

With the TrueType fonts, both Chrome and Firefox complain not to be able to load the embedded font: OTS parsing error: OS/2: missing required table This is what I have already faced during my experiments. Do you have any idea how to solve this?

m-abboud commented 8 years ago

Yeah see bare-cff.pdf in the commit

For TrueType, looks like strict validation by the browsers don't think OS/2 table is required in ttf spec, but should be simple just to normalize ttf fonts and just jam in an OS/2 table if ones missing I'll do it in a sec

m-abboud commented 8 years ago

Dump of a bunch of car manual pdfs here: https://drive.google.com/open?id=0B-4Jtn2YkMBxZ2dEcHFGR0dZQlU

And this pdf has like 8 CFFs: https://drive.google.com/open?id=0B-4Jtn2YkMBxeHFFTFJUcjZjUHc (although 2 don't work because of a bug in FontVerter version Pdf2dom is on, I've already fixed it though, need to switch it to snapshots..)

Bumped into ~6 truetype fonts in pdfs I ran and only got unusual cmap table errors in 2 and others fine. Can you send me one of your pdfs with the OS/2 table error?

m-abboud commented 8 years ago

Added OS/2 table normalization in this branch: https://github.com/m-abboud/Pdf2Dom/tree/normalize-ttfs

Though fixing that issue might just reveal more validation issues with the TTF

radkovo commented 8 years ago

Hi, thanks for the conversion. I have given it a try and I was able to convert the following one: BP.pdf However, I obtain a strange message from Firefox: downloadable font: not usable by platform. I have also tried the following two: brno30.pdf and HorariosMadrid_Segovia.pdf but I obtain a null pointer exception somewhere inside FontVerter (different place for each one). I mean the brno30.pdf is especially interesting because of some symbol font used.

m-abboud commented 8 years ago

Ah so problem is they have type0 fonts with a ttf descendant which means the ttf font is usually incomplete and missing a few tables that have to be created from the parent type0 font data

So I've got that kinda sorta working in latest version here https://github.com/m-abboud/Pdf2Dom/tree/normalize-ttfs

HorariosMadrid_Segovia.pdf only 2 fonts fail validation in FF and chrome

brno30.pdf was all working but I broke something somewhere and now 1 font is failing validation. Also the symbol fonts are usually type bare CFF which FontVerter handles well.

Think I know what the problem is for the 2 fonts in HorariosMadrid_Segovia.pdf, but feeling a little burned out on fonts now lol.