radkovo / Pdf2Dom

Pdf2Dom is a PDF parser that converts the documents to a HTML DOM representation. The obtained DOM tree may be then serialized to a HTML file or further processed. A command-line utility for converting the PDF documents to HTML is included in the distribution package. Pdf2Dom may be also used as an independent Java library with a standard DOM interface for your DOM-based applications or as an alternative parser for the CSSBox rendering engine in order to add the PDF processing capability to CSSBox. Pdf2Dom is based on the Apache PDFBox™ library.
http://cssbox.sourceforge.net/pdf2dom/
GNU Lesser General Public License v3.0
175 stars 71 forks source link

Getting java.lang.UnsupportedOperationException at org.apache.pdfbox.pdmodel.graphics.color.PDPattern.toRGB #31

Open aino-vedang opened 5 years ago

aino-vedang commented 5 years ago

I am using Pdf2Dom to parse pdf document. In my java application when I tried to convert a PDF file to html. I am getting,

java.lang.UnsupportedOperationException at org.apache.pdfbox.pdmodel.graphics.color.PDPattern.toRGB(PDPattern.java:95) at org.fit.pdfdom.PathDrawer.pdfColorToColor(PathDrawer.java:133) at org.fit.pdfdom.PathDrawer.clearPathGraphics(PathDrawer.java:79) at org.fit.pdfdom.PathDrawer.drawPath(PathDrawer.java:59) at org.fit.pdfdom.PDFDomTree.createPathImage(PDFDomTree.java:403) at org.fit.pdfdom.PDFDomTree.renderPath(PDFDomTree.java:251) at org.fit.pdfdom.PDFBoxTree.processOperator(PDFBoxTree.java:499) at org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:503) at org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:477) at org.apache.pdfbox.contentstream.PDFStreamEngine.showForm(PDFStreamEngine.java:181) at org.apache.pdfbox.contentstream.operator.DrawObject.process(DrawObject.java:65) at org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:848) at org.fit.pdfdom.PDFBoxTree.processOperator(PDFBoxTree.java:542) at org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:503) at org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:477) at org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:150) at org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:139) at org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:391) at org.fit.pdfdom.PDFBoxTree.processPage(PDFBoxTree.java:208) at org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319) at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266) at org.fit.pdfdom.PDFDomTree.createDOM(PDFDomTree.java:218) at com.demo.pdf.converter.PdfProcessor.convertToHtml(PdfProcessor.java:87)

THausherr commented 5 years ago

I suspect this is a bug in PDF2Dom, a pattern in a PDF can't be converted to an RGB color. (Think about it - e.g. a dots pattern isn't one RGB color, it is a vector graphics instruction).

To see how patterns are treated in PDFBox, see PageDrawer.getPaint().

More files with patterns can be found here: https://issues.apache.org/jira/browse/PDFBOX-1094

aino-gautam commented 5 years ago

@aino-vedang pdf2dom internally uses pdfbox and not the other way around as you mentioned. Did you find a solution yet ?

And seems @THausherr is correct in his quote that the issue lies in pdf2dom.

aino-vedang commented 5 years ago

@aino-gautam I realised that issue is in pdf2dom library and just now I have updated the question. But I haven't found solution for it yet.