radkovo / Pdf2Dom

Pdf2Dom is a PDF parser that converts the documents to a HTML DOM representation. The obtained DOM tree may be then serialized to a HTML file or further processed. A command-line utility for converting the PDF documents to HTML is included in the distribution package. Pdf2Dom may be also used as an independent Java library with a standard DOM interface for your DOM-based applications or as an alternative parser for the CSSBox rendering engine in order to add the PDF processing capability to CSSBox. Pdf2Dom is based on the Apache PDFBox™ library.
http://cssbox.sourceforge.net/pdf2dom/
GNU Lesser General Public License v3.0
175 stars 71 forks source link

embedfonts branch remaining work? #7

Closed m-abboud closed 8 years ago

m-abboud commented 8 years ago

(More of a discussion item than an actual issue)

So what remaining work is there for the embedfonts branch? From what I see it just needs adding conversion for the various non browser supported pdf font types?

I added support for bare CFF (PdfBox calls it Type1C I believe) in my embedded-fonts-2 branch and maybe going to start work on others and if I recall correctly bare CFF is the most common but maybe wishful thinking.

radkovo commented 8 years ago

This seems very nice. I just ended up with being unable to extract a font file that would be accepted by the browsers. Obviously some conversion is necessary but since I am not expert on fonts, I have given up for that moment. If your conversion works, I'll be pleased to merge your branch.

radkovo commented 8 years ago

Resolved by #8 and #10.