radkovo / Pdf2Dom

Pdf2Dom is a PDF parser that converts the documents to a HTML DOM representation. The obtained DOM tree may be then serialized to a HTML file or further processed. A command-line utility for converting the PDF documents to HTML is included in the distribution package. Pdf2Dom may be also used as an independent Java library with a standard DOM interface for your DOM-based applications or as an alternative parser for the CSSBox rendering engine in order to add the PDF processing capability to CSSBox. Pdf2Dom is based on the Apache PDFBox™ library.
http://cssbox.sourceforge.net/pdf2dom/
GNU Lesser General Public License v3.0
175 stars 71 forks source link

Infinite loop in PDFBoxTree.processFontResources() #36

Closed WPCleaner closed 5 years ago

WPCleaner commented 5 years ago

Hello,

For one given PDF file, I have an infinite recursive call in PDFBoxTree.processFontResources() resulting in a StackOverflowError. I have several dozens of PDF files that I want to convert, but I have this problem for only one. Unfortunately, I can't share the PDF that results in a problem as it is confidential...

It's happening with the last release, 1.7.

The stack trace I get is the following :

java.lang.StackOverflowError
        at java.util.zip.Inflater.<init>(Inflater.java:102)
        at org.apache.pdfbox.filter.FlateFilter.decompress(FlateFilter.java:74)
        at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:50)
        at org.apache.pdfbox.filter.Filter.decode(Filter.java:87)
        at org.apache.pdfbox.cos.COSInputStream.create(COSInputStream.java:77)
        at org.apache.pdfbox.cos.COSStream.createInputStream(COSStream.java:175)
        at org.apache.pdfbox.cos.COSStream.createInputStream(COSStream.java:163)
        at org.apache.pdfbox.pdmodel.common.PDStream.createInputStream(PDStream.java:236)
        at org.apache.pdfbox.pdmodel.common.PDStream.toByteArray(PDStream.java:505)
        at org.fit.pdfdom.FontTable$Entry.loadTrueTypeFont(FontTable.java:176)
        at org.fit.pdfdom.FontTable$Entry.getData(FontTable.java:147)
        at org.fit.pdfdom.FontTable$Entry.isEntryValid(FontTable.java:161)
        at org.fit.pdfdom.FontTable.addEntry(FontTable.java:48)
        at org.fit.pdfdom.PDFBoxTree.processFontResources(PDFBoxTree.java:378)
        at org.fit.pdfdom.PDFBoxTree.processFontResources(PDFBoxTree.java:403)
        at org.fit.pdfdom.PDFBoxTree.processFontResources(PDFBoxTree.java:403)
        at org.fit.pdfdom.PDFBoxTree.processFontResources(PDFBoxTree.java:403)
        at org.fit.pdfdom.PDFBoxTree.processFontResources(PDFBoxTree.java:403)
        at org.fit.pdfdom.PDFBoxTree.processFontResources(PDFBoxTree.java:403)
        at org.fit.pdfdom.PDFBoxTree.processFontResources(PDFBoxTree.java:403)
        at org.fit.pdfdom.PDFBoxTree.processFontResources(PDFBoxTree.java:403)
        at org.fit.pdfdom.PDFBoxTree.processFontResources(PDFBoxTree.java:403)
        at org.fit.pdfdom.PDFBoxTree.processFontResources(PDFBoxTree.java:403)
        at org.fit.pdfdom.PDFBoxTree.processFontResources(PDFBoxTree.java:403)
        at org.fit.pdfdom.PDFBoxTree.processFontResources(PDFBoxTree.java:403)
        at org.fit.pdfdom.PDFBoxTree.processFontResources(PDFBoxTree.java:403)
        at org.fit.pdfdom.PDFBoxTree.processFontResources(PDFBoxTree.java:403)
        at org.fit.pdfdom.PDFBoxTree.processFontResources(PDFBoxTree.java:403)
        at org.fit.pdfdom.PDFBoxTree.processFontResources(PDFBoxTree.java:403)
        at org.fit.pdfdom.PDFBoxTree.processFontResources(PDFBoxTree.java:403)
        at org.fit.pdfdom.PDFBoxTree.processFontResources(PDFBoxTree.java:403)
        at org.fit.pdfdom.PDFBoxTree.processFontResources(PDFBoxTree.java:403)
        at org.fit.pdfdom.PDFBoxTree.processFontResources(PDFBoxTree.java:403)
        at org.fit.pdfdom.PDFBoxTree.processFontResources(PDFBoxTree.java:403)
        at org.fit.pdfdom.PDFBoxTree.processFontResources(PDFBoxTree.java:403)
        ...

The code I use is the following :

        File pdfFile = new File("test.pdf");
        File outFile = new File("test.html");
        try (PDDocument pdf = PDDocument.load(pdfFile)) {
            PDFDomTree parser = new PDFDomTree();
            Document dom = parser.createDOM(pdf);
            TransformerFactory transFactory = TransformerFactory.newInstance();
            Transformer trans = transFactory.newTransformer();
            Source src = new DOMSource(dom);
            Result dest = new StreamResult(outFile);
            trans.transform(src, dest);
        }
radkovo commented 5 years ago

Hmm, strange. Could you test with current master version? Just guessing but the following change in Pdf2Dom might help:

In PDFBoxTree,java at line 402 (here) try to change it from

if (formResources != null)

to

if (formResources != null && formResources != resources)

This is just a guess so I don't want to commit it to the repo but it's worth trying.

WPCleaner commented 5 years ago

As I'm using a class derived from PDFBoxTree for my purpose (trying to convert PDF into something usable with Angular, not directly HTML), I've done the following tests :

Copy/Paste PDFBoxTree.updateFontTable() and PDFBoxTree.processFontResources() in my own class: obviously, the problem is still present but occuring in processFontResources() in my class.

Modify processFontResources() in my class according to your suggestion: same problem, I still end up with a StackOverflowError

I then tried to add a third parameter to processFontResources(), a Set<PDResources>, to keep in memory which resources have been processed and exit processFontResources() if the current resource has already been processed: same problem, I still end up with a StackOverflowError.

I then searched the source code and realized that PDFFormXObjet.getResources() creates a new PDResources each time it is called, so my Set<> won't be able to detect circular references. So I modified the Set to be a Set<COSDictionary> and check on resources.getCOSObject(): it worked !

So I finally tried again your suggestion but including the fact that the actual PDResources object will be different: it worked also !

So replacing line 402 in PDFBoxTree.java by the following line should work: if (formResources != null && formResources != resources && formResources.getCOSObject() != resources.getCOSObject)

radkovo commented 5 years ago

Great, thanks for proposing the solution. It seems reasonable; I have committed it to master.