tabulapdf / tabula-java

Extract tables from PDF files
MIT License
1.85k stars 430 forks source link

ObjectExtractorStreamEngine reads beyond the end of stroke paths #452

Open gamorris opened 3 years ago

gamorris commented 3 years ago

ObjectExtractorStreamEngine.java contains this code

        while (!pathIterator.isDone()) {
            pathIterator.next();
            // This can be the last segment, when pathIterator.isDone, but we need to
            // process it otherwise us-017.pdf fails the last value.
            try {
                currentSegment = pathIterator.currentSegment(coordinates);
            } catch (IndexOutOfBoundsException ex) {
                continue;
            }

I was looking at doing a little refactoring to that code[^1] and got curious about what was happening there. Changing the code to not read past pathIterator.isDone() does in fact cause the unit test TestSpreadsheetExtractor.testNaturalOrderOfRectanglesDoesNotBreakContract (which reads us-017.pdf) to fail. The failure suggests it is not reading the bottom right table cell.

I added some System.err.println debugging at the top of that while loop to see what it is doing. The debugging for the first line segment in us-017.pdf looks like:

segment: SEG_LINETO; done? false startPoint=Point2D.Float[91.75, 694.92] endPoint=null last_move=Point2D.Float[91.75, 694.92] coordinates=Point2D.Float[323.17, 694.92]
segment: SEG_LINETO; done? true startPoint=Point2D.Float[323.17, 694.92] endPoint=Point2D.Float[323.17, 694.92] last_move=Point2D.Float[91.75, 694.92] coordinates=Point2D.Float[323.17, 462.2]

From loading up us-017.pdf in the pdfbox debugger, if I am not mistaken that comes from this chunk of pdf:

  q
    1.1040955 0 0 -1 91.74725 352 cm
    0.153 0.145 0.145 RG
    1 w
    10 M
    0 j
    0 J
    [ ] 0 d
    /GS2 gs
    0 254.924 m
    209.606 254.924 l
    S
  Q

There is in fact only 1 line to segment in that pdf content (the l operator at the bottom). Of the 73 line segments extracted from that page, for every single one of them the coordinates pulled from pathIterator.currentSegment(_) when pathIterator.isDone() == true are that same point: Point2D.Float[323.17, 462.2]

Where does that coordinate come from? I think it's the second corner in the rectangles drawn (filled) as the background for the header row in the table on that page (page 2 of us-017.pdf).

So...it looks to me like that loop is reading coordinates that it shouldn't (conceptually: it's reading uninitialized memory), and thus also generating lines that don't actually exist in the PDF. However, the comment in the code is correct: fixing the loop apparently breaks the extraction for the us-017.pdf file. I think that means there is a bug elsewhere in the extraction machinery that the extra loop iteration is hiding.

[^1]: I have a pdf where it seems table extraction would be improved by tracking shaded regions, so I was looking to add support for filled rectangles.

gamorris commented 3 years ago

It looks like the change to Utils.java proposed in #242 fixes the problem with us-017.pdf that over-reading the path iterator was intended to fix.