samtools / htsjdk

A Java API for high-throughput sequencing data (HTS) formats.
http://samtools.github.io/htsjdk/
278 stars 244 forks source link

LittleEndianInputStream EOF test is incorrect #1614

Open cmnbroad opened 2 years ago

cmnbroad commented 2 years ago

As reported by @jrobinso: The test for EOF in LittleEndianInputStream.readString is wrong because in.read() is cast to a byte. The contract for read() returns an "int", and the signal for EOF is the return value is -1. However this is no longer a valid test after casting to a byte. The correct thing to do here, I think, is type "b" as an int, the cast to byte only when written to "bis".

    byte b;
    while ((b = (byte) in.read()) != 0) {
        if(b < 0) {
            throw new EOFException();
        }
        bis.write(b);
lbergelson commented 1 year ago

This is more complicated than originally described because the file encoding isn't really described and java bytes include negative values.

Currently EOFException seems to be triggered correctly when the end of file is reached prematurely, but it's also triggered incorrectly on some non-ASCII characters which truncate to negative byte values.

It's unclear if this is meant to only read ASCII, ISO 8859-1, or UTF-8 but nothing outside of the ascii space works correctly. How to fix it depends on what file encoding we're intending to support which means digging into what this thing actually is used for.

jrobinso commented 1 year ago

@lbergelson File encoding should be irrelevant, this is a raw byte stream. Regardless of other issues casting in.read() to a byte before that test is incorrect, I think. A positive int can become a negative byte.

jrobinso commented 1 year ago

InputStream.read() is supposed to return an integer between 0 and 255, or -1 if EOF, there is no encoding involved. So 1/2 of the values it can legally return will be negative when cast to a byte.

jrobinso commented 1 year ago

It's a weird convention, but its been like that since the beginning. I suppose the contract is for an "int" specifically so they can have a return value (-1) which indicates EOF. If the contract was for a byte there would be no available values to indicate EOF.

lbergelson commented 1 year ago

@jrobinso I agree that it's currently wrong and the fix you proposed is an improvement. What I meant to say is that we have to decide what the file encoding should be when we do the conversion to String because we will get incorrect/inconsistent answers if there's a mismatch and the file contains non-ascii values even once your fix goes in.

jrobinso commented 1 year ago

@lbergelson I still don't see what encoding has to do with this class, which is a stream reader for raw bytes. You can decode these bytes anyway you please, but that is downstream. In many cases the bytes will not be decoded to strings at all, but again what is done with the bytes is no concern here.

lbergelson commented 1 year ago

@jrobinso It's relevant because the method in question returns a string as it's result, so it has to make a decision about what the bytes mean.

jrobinso commented 1 year ago

@lbergelson OK, yes, sorry, I was focused on the test in question, the part below. It is most definitely a bug to cast the result from in.read() to a byte before testing for b<0. An EOF exception should not be thrown unless -1 is returned as an int. If there is a bug decoding the string later that would be a different problem. This method probably doesn't belong on LittleEndianInputStream, but since it's there maybe it should take the encoding as an argument, defaulting to utf-8. In practice this method is used to read genomic file formats, many which specify ascii text. If some other character encoding is used we might get a garbled string, or some other error, but we shouldn't see an EOFException.

    byte b;
    while ((b = (byte) in.read()) != 0) {
        if(b < 0) {
            throw new EOFException();
        }
        bis.write(b);
jrobinso commented 1 year ago

Here's the relevant text for GFF format, which I think is pretty typical among genomic file formats. The term "plain text" might be ambiguous, but I think is understand to mean ascii

" GFF3 files are 9-column, tab-delimited plain text files. Literal tabs, newline, carriage return, percent sign and control characters must be encoded using RFC 3986 Percent-Encoding "

The SAM specification is more specific

"Unless explicitly specified elsewhere, all fields are encoded using 7-bit US-ASCII 1".

WRT "specified elsewhere", a number of fields allow UTF-8.