radkovo / Pdf2Dom

Pdf2Dom is a PDF parser that converts the documents to a HTML DOM representation. The obtained DOM tree may be then serialized to a HTML file or further processed. A command-line utility for converting the PDF documents to HTML is included in the distribution package. Pdf2Dom may be also used as an independent Java library with a standard DOM interface for your DOM-based applications or as an alternative parser for the CSSBox rendering engine in order to add the PDF processing capability to CSSBox. Pdf2Dom is based on the Apache PDFBox™ library.
http://cssbox.sourceforge.net/pdf2dom/
GNU Lesser General Public License v3.0
178 stars 72 forks source link

infinite loop issue in the font handling #27

Closed fisch92 closed 3 years ago

fisch92 commented 6 years ago

Hi, we think we found an infinite loop issue in the font handling of Pdf2Dom 1.7. The stack trace is: org.fit.pdfdom.FontTable.nextUsedName(FontTable.java:83) org.fit.pdfdom.FontTable.addEntry(FontTable.java:45) org.fit.pdfdom.PDFBoxTree.processFontResources(PDFBoxTree.java:378) org.fit.pdfdom.PDFBoxTree.updateFontTable(PDFBoxTree.java:361) org.fit.pdfdom.PDFDomTree.updateFontTable(PDFDomTree.java:544) org.fit.pdfdom.PDFBoxTree.processPage(PDFBoxTree.java:206) org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319) org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266) org.fit.pdfdom.PDFDomTree.createDOM(PDFDomTree.java:218) ...

When looking at the code we see that the index variable i is not increased, so for the third font with the same name an infinite loop will occur.

protected String nextUsedName(String fontName)
{
    int i = 1;
    String usedName = fontName;
    while (isNameUsed(usedName))
        usedName = fontName + i;

    return usedName;

}

We propose the following fix:

protected String nextUsedName(String fontName)
{
    int i = 1;
    String usedName = fontName;
    while (isNameUsed(usedName))
    {
       usedName = fontName + i;
       i++;
    }

    return usedName;

}

You can reproduce it for example with this PDF file: http://regalwerk.de/fileadmin/user_upload/RW_Katalog_2018_2019_72DPI.pdf

on page 115 there are the following fonts: VRXWUQ+Verdana VRXWUQ+Verdana VRXWUQ+Futura-Bold VRXWUQ+Futura-Book VRXWUQ+Verdana-Bold VRXWUQ+Verdana

causing the algorithm to hang.

We will use a workaround which counts the present fonts and skips problematic pages until the issue is fixed in Pdf2Dom.

Have a nice day

m-abboud commented 6 years ago

Hello, thanks for the report! Sounds like a high severity bug so I'll look at this one and the one you made in FontVerter when I get home from work in 5 or so hours.

If you want another workaround there's an option somewhere in some config class you can pass in to disable font processing entirely. (Been a minute since I worked on this so I forget the name)

fisch92 commented 6 years ago

If you want another workaround there's an option somewhere in some config class you can pass in to disable font processing entirely. (Been a minute since I worked on this so I forget the name)

Thank you, the workaround works for me.

Have a nice day

1289naveen commented 6 years ago

is this issue fixed and updated????