ObjectExtractorStreamEngine.java contains this code
while (!pathIterator.isDone()) {
pathIterator.next();
// This can be the last segment, when pathIterator.isDone, but we need to
// process it otherwise us-017.pdf fails the last value.
try {
currentSegment = pathIterator.currentSegment(coordinates);
} catch (IndexOutOfBoundsException ex) {
continue;
}
I was looking at doing a little refactoring to that code[^1] and got curious about what was happening there. Changing the code to not read past pathIterator.isDone() does in fact cause the unit test TestSpreadsheetExtractor.testNaturalOrderOfRectanglesDoesNotBreakContract (which reads us-017.pdf) to fail. The failure suggests it is not reading the bottom right table cell.
I added some System.err.println debugging at the top of that while loop to see what it is doing. The debugging for the first line segment in us-017.pdf looks like:
From loading up us-017.pdf in the pdfbox debugger, if I am not mistaken that comes from this chunk of pdf:
q
1.1040955 0 0 -1 91.74725 352 cm
0.153 0.145 0.145 RG
1 w
10 M
0 j
0 J
[ ] 0 d
/GS2 gs
0 254.924 m
209.606 254.924 l
S
Q
There is in fact only 1 line to segment in that pdf content (the l operator at the bottom). Of the 73 line segments extracted from that page, for every single one of them the coordinates pulled from pathIterator.currentSegment(_) when pathIterator.isDone() == true are that same point: Point2D.Float[323.17, 462.2]
Where does that coordinate come from? I think it's the second corner in the rectangles drawn (filled) as the background for the header row in the table on that page (page 2 of us-017.pdf).
So...it looks to me like that loop is reading coordinates that it shouldn't (conceptually: it's reading uninitialized memory), and thus also generating lines that don't actually exist in the PDF. However, the comment in the code is correct: fixing the loop apparently breaks the extraction for the us-017.pdf file. I think that means there is a bug elsewhere in the extraction machinery that the extra loop iteration is hiding.
[^1]: I have a pdf where it seems table extraction would be improved by tracking shaded regions, so I was looking to add support for filled rectangles.
ObjectExtractorStreamEngine.java
contains this codeI was looking at doing a little refactoring to that code[^1] and got curious about what was happening there. Changing the code to not read past
pathIterator.isDone()
does in fact cause the unit testTestSpreadsheetExtractor.testNaturalOrderOfRectanglesDoesNotBreakContract
(which readsus-017.pdf
) to fail. The failure suggests it is not reading the bottom right table cell.I added some
System.err.println
debugging at the top of thatwhile
loop to see what it is doing. The debugging for the first line segment inus-017.pdf
looks like:From loading up
us-017.pdf
in the pdfbox debugger, if I am not mistaken that comes from this chunk of pdf:There is in fact only 1 line to segment in that pdf content (the
l
operator at the bottom). Of the 73 line segments extracted from that page, for every single one of them the coordinates pulled frompathIterator.currentSegment(_)
whenpathIterator.isDone() == true
are that same point:Point2D.Float[323.17, 462.2]
Where does that coordinate come from? I think it's the second corner in the rectangles drawn (filled) as the background for the header row in the table on that page (page 2 of
us-017.pdf
).So...it looks to me like that loop is reading coordinates that it shouldn't (conceptually: it's reading uninitialized memory), and thus also generating lines that don't actually exist in the PDF. However, the comment in the code is correct: fixing the loop apparently breaks the extraction for the
us-017.pdf
file. I think that means there is a bug elsewhere in the extraction machinery that the extra loop iteration is hiding.[^1]: I have a pdf where it seems table extraction would be improved by tracking shaded regions, so I was looking to add support for filled rectangles.