sunmingtao / sample-code

3 stars 4 forks source link

doss.local.SeekableChannelInputStream.read() returned value out of range -1..255: -117 #178

Closed sunmingtao closed 3 years ago

sunmingtao commented 3 years ago

Cannot read from GZipStream

15:16:35.992 java[23934]: 2020-09-04 15:16:35,992: INFO : au.gov.nla.dl.services.AmberSolrIndexService - Indexing obj id: 25596857 and its children
15:16:35.992 java[23934]: 2020-09-04 15:16:35,992: DEBUG: au.gov.nla.dl.services.AmberSolrIndexService - - Indexing metadata...
15:16:36.007 java[23934]: 2020-09-04 15:16:36,007: INFO : au.gov.nla.dl.services.AmberSolrIndexService - Successfully indexed obj id: 25596857 metadata
15:16:36.007 java[23934]: 2020-09-04 15:16:36,007: DEBUG: au.gov.nla.dl.services.AmberSolrIndexService - - Index ocr data
15:16:36.077 java[23934]: 2020-09-04 15:16:36,077: INFO : au.gov.nla.dl.utils.WorkOcrHelper - Aborting book ocr retrieval. Book: 25597189 error: doss.local.SeekableChannelInputStream.read() returned value out of range -1..255: -117
15:16:36.077 java[23934]: 2020-09-04 15:16:36,077: ERROR: au.gov.nla.dl.services.AmberSolrIndexService - Failed to index OCR data for work 25596857, error: null 
sunmingtao commented 3 years ago

it seems the new SeekableChannelInputStream.read() method doesn't return a value in the expected range for gzip files, as demonstrated in the following program. Doss version 1.3.1 worked fine in Indexer while it didn't have this class yet. The indexer is now referencing Doss 1.6.4.

   public static void main(String[] args) throws IOException {
        Path path = Paths.get("/tmp/test.tgz"); //any gzip file
        try (SeekableByteChannel sbc = Files.newByteChannel(path, StandardOpenOption.READ);
             SeekableChannelInputStream steam = new SeekableChannelInputStream(sbc)){
            System.out.println(steam.read()); // return 31, OK
            System.out.println(steam.read()); // return -117, Bad
            // InputStream method contract stipulates the method must return in the range of -1...255
            //https://docs.oracle.com/javase/7/docs/api/java/io/InputStream.html#read()
        }
    }

I found this https://discuss.newrelic.com/t/countinginputstream-read-returned-value-out-of-range-1-255-53/53208/8, which might be related, in particular:

"Ok, so playing a bit more I think I actually found the root cause. I was wrong about being related to version number, this class didn’t change recently, although maybe the buffering did, but probably not. com.newrelic.agent.android.instrumentation.io.CountingInputStream buffering “garbles” the output that is read from the underlying InputStream (impl). The InputStream contract is int read() (in range [0, 255]), but the buffer inside CountingInputStream stores bytes (in range [-128, 127]). In CountingInputStream.readBuffer the buffer.get() call returns a byte which is then implicitly widened to int (readBuffer's return type). When converting bits of an int that is in [128, 255] range, that flips over to [-128, -1] because of the sign bit at the beginning, but when the implicit widening happens it doesn’t flip back."

The fix is in doss.local.SeekableChannelInputStream.read()

I am happy to create a PR if you can please give me access to the git repo

return n == 1 ? b[0] : -1; => return n == 1 ? b[0] & 0xff : -1;