Closed THausherr closed 7 years ago
Thank you for filing this issue. I schedule a look.
Here's also the PDF file for future regression tests. 584334-JBig2-p1.pdf
Thanks for providing the test resource!
The issue is related to text regions that use Huffman coding. I found two problems in class com.levigo.jbig2.segments.TextRegion. The first problem is in the creation of the symbol ID table. Replace the method symbolIDCodeLengths() with the following:
private void symbolIDCodeLengths() throws IOException {
/* 1) - 2) */
final List<Code> runCodeTable = new ArrayList<Code>();
for (int i = 0; i < 35; i++) {
final int prefLen = (int) (subInputStream.readBits(4) & 0xf);
if (prefLen > 0) {
runCodeTable.add(new Code(prefLen, 0, i, false));
}
}
if (JBIG2ImageReader.DEBUG)
log.debug(HuffmanTable.codeTableToString(runCodeTable));
HuffmanTable ht = new FixedSizeTable(runCodeTable);
/* 3) - 5) */
long previousCodeLength = 0;
int counter = 0;
final List<Code> sbSymCodes = new ArrayList<Code>();
while (counter < amountOfSymbols) {
final long code = ht.decode(subInputStream);
if (code < 32) {
if (code > 0) {
sbSymCodes.add(new Code((int) code, 0, counter, false));
}
previousCodeLength = code;
counter++;
} else {
long runLength = 0;
long currCodeLength = 0;
if (code == 32) {
runLength = 3 + subInputStream.readBits(2);
if (counter > 0) {
currCodeLength = previousCodeLength;
}
} else if (code == 33) {
runLength = 3 + subInputStream.readBits(3);
} else if (code == 34) {
runLength = 11 + subInputStream.readBits(7);
}
for (int j = 0; j < runLength; j++) {
if (currCodeLength > 0) {
sbSymCodes.add(new Code((int) currCodeLength, 0, counter, false));
}
counter++;
}
}
}
/* 6) - Skip over remaining bits in the last Byte read */
subInputStream.skipBits();
/* 7) */
symbolCodeTable = new FixedSizeTable(sbSymCodes);
}
As the standard says (ITU T.88, page 60): When code is 33 or 34, the for loop should repeat the value 0, not previousCodeLength. Only if code is 32, then previousCodeLength is repeated.
Another problem is in the method getUserTable() in the same class TextRegion. Its search algorithm does not work. A method that implements the search properly can be found in getUserTable() in class SymbolDictionary. Copy that method to TextRegion, replacing the old one. Here is the copied method:
private HuffmanTable getUserTable(final int tablePosition) throws InvalidHeaderValueException, IOException {
int tableCounter = 0;
for (final SegmentHeader referredToSegmentHeader : segmentHeader.getRtSegments()) {
if (referredToSegmentHeader.getSegmentType() == 53) {
if (tableCounter == tablePosition) {
final Table t = (Table) referredToSegmentHeader.getSegmentData();
return new EncodedTable(t);
} else {
tableCounter++;
}
}
}
return null;
}
This latter fix for getUserTable() might not be necessary in this current issue 21. But it is needed to fix an earlier open issue 16 at Google Code. That old issue contains attached PDF and JBIG2 files that use a user table in a text region.
Thank you. Awesome! I'll compose a pull-request with your suggested fix.
See #22
@THausherr I noticed that the decoded image produced by your JBIG2 file is slightly incomplete. Some text is missing at the bottom of the page: "Hard Copy Not Controlled..." (in a rectangular frame) and "Contract No ...". If I open your PDF in Acrobat Reader, those texts are shown. My pull request #29 should fix this problem and make the texts visible.
jbig2bug.zip
my code:
I'm using your library as part of the Apache PDFBox project. The two data segments come from the PDF, which displays in Adobe Reader, so I'd assume that the image is valid.