radkovo / Pdf2Dom

Pdf2Dom is a PDF parser that converts the documents to a HTML DOM representation. The obtained DOM tree may be then serialized to a HTML file or further processed. A command-line utility for converting the PDF documents to HTML is included in the distribution package. Pdf2Dom may be also used as an independent Java library with a standard DOM interface for your DOM-based applications or as an alternative parser for the CSSBox rendering engine in order to add the PDF processing capability to CSSBox. Pdf2Dom is based on the Apache PDFBox™ library.
http://cssbox.sourceforge.net/pdf2dom/
GNU Lesser General Public License v3.0
175 stars 71 forks source link

Resulting HTML does not include PDF form fields #17

Open jezerinac opened 8 years ago

jezerinac commented 8 years ago

Code: ` PDFDomTreeConfig settings = PDFDomTreeConfig.createDefaultConfig(); settings.setFontHandler(PDFDomTreeConfig.embedAsBase64()); settings.setImageHandler(PDFDomTreeConfig.embedAsBase64());

PDFDomTree pdfDomTree = new PDFDomTree(settings); try (PDDocument pdf = PDDocument.load(inputStream)) { try (PrintWriter output = new PrintWriter(outputPath.toFile(), "utf-8")) { pdfDomTree.writeText(pdf, output); } } `

Input: sample-form.pdf

Output: does not include any html fields

m-abboud commented 7 years ago

I'm starting work on this today (if anyone was asking)

radkovo commented 7 years ago

That's great news! I wouldn't have time to take a look at this in a near future. Many thanks!

AdeshAtole commented 6 years ago

@m-abboud Is this open to work on? @radkovo

radkovo commented 6 years ago

Yes, go ahead (if there are no comments from @m-abboud )

AdeshAtole commented 6 years ago

@radkovo Any contributing guidelines?