radkovo / Pdf2Dom

Pdf2Dom is a PDF parser that converts the documents to a HTML DOM representation. The obtained DOM tree may be then serialized to a HTML file or further processed. A command-line utility for converting the PDF documents to HTML is included in the distribution package. Pdf2Dom may be also used as an independent Java library with a standard DOM interface for your DOM-based applications or as an alternative parser for the CSSBox rendering engine in order to add the PDF processing capability to CSSBox. Pdf2Dom is based on the Apache PDFBox™ library.
http://cssbox.sourceforge.net/pdf2dom/
GNU Lesser General Public License v3.0
175 stars 71 forks source link

PDFDomTree font extract options #9

Closed m-abboud closed 8 years ago

m-abboud commented 8 years ago

Adds PDFDomTree font extract modes for embedding fonts, saving fonts to disk and ignoring pdf fonts completely.

Usage:

PDFDomTreeConfig config = PDFDomTreeConfig.createDefaultConfig();
config.setFontExtractDirectory(fontDir);
config.setFontMode(SAVE_TO_DIR);

PDFDomTree parser = new PDFDomTree(config);

Modes are: EMBED_BASE64, SAVE_TO_DIR, IGNORE_FONTS