oracle / oci-hdfs-connector

HDFS Connector for Oracle Cloud Infrastructure
https://cloud.oracle.com/cloud-infrastructure
Other
27 stars 26 forks source link

`ArrayIndexOutOfBoundsException` in read-ahead mode #69

Closed kvirund closed 2 years ago

kvirund commented 2 years ago

I am getting error when I am using read-ahead mode:

Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 15639
    at com.oracle.bmc.hdfs.store.BmcReadAheadFSInputStream.read(BmcReadAheadFSInputStream.java:68)
    at java.io.FilterInputStream.read(FilterInputStream.java:83)
    at com.homesoft.spark.SparkClient.testBmcReadAhead(SparkClient.java:114)
    at com.homesoft.spark.SparkClient.main(SparkClient.java:102)

Here is the minimal reproducible test:

        try (final BmcFilesystem filesystem = BmcFilesystemTest.getOOSFilesystem("oci://bucket@tenancy",
                ReadMode.READ_AHEAD,
                (Integer) BmcProperties.READ_AHEAD_BLOCK_SIZE.getDefaultValue());
             final InputStream inputStream = filesystem.open(new Path(filename))) {
            logger.info("Reading file {} byte-by-byte", filename);
            while (-1 != inputStream.read()) ;
            logger.info("Successfully read file {}", filename);
        } catch (IOException | URISyntaxException e) {
            e.printStackTrace();
        }

where

public class BmcFilesystemTest {
...
    public static BmcFilesystem getOOSFilesystem(String storageUri, ReadMode readMode, int readAheadBlockSize) throws IOException, URISyntaxException {
        final BmcFilesystem filesystem = new BmcFilesystem();
        final Configuration configuration = getOOSConfiguration(readMode, readAheadBlockSize);

        filesystem.initialize(new URI(storageUri), configuration);
        return filesystem;
    }
...
    public static Configuration getOOSConfiguration(ReadMode readMode, int readAheadBlockSize) {
        final Configuration configuration = new Configuration();
        configuration.set("fs.oci.client.hostname", "https://objectstorage.us-ashburn-1.oraclecloud.com");
        configuration.set("fs.oci.client.auth.tenantId", "<tenancy OCID>");
        configuration.set("fs.oci.client.auth.userId", "<user OCID>");
        configuration.set("fs.oci.client.auth.fingerprint", "<fingerprint>");
        configuration.set("fs.oci.client.auth.pemfilepath", "<path to the private key>");
        configuration.set("fs.oci.client.auth.passphrase", "");
        configuration.set(BmcConstants.IN_MEMORY_READ_BUFFER_KEY, String.valueOf(ReadMode.IN_MEMORY == readMode));
        configuration.set(BmcConstants.READ_AHEAD_KEY, String.valueOf(ReadMode.READ_AHEAD == readMode));
        configuration.set(BmcConstants.READ_AHEAD_BLOCK_SIZE_KEY, String.valueOf(readAheadBlockSize));

        configuration.set("fs.oci.client.apache.connection.closing.strategy", "immediate"); // to avoid reading entire stream
        return configuration;
    }

I am not sure if there is some logic I am not aware of but it looks like the bug is here:

    @Override
    public int read() throws IOException {
        LOG.debug("{}: Reading single byte at position {}", this, filePos);
        if (dataPos == -1) {
            fillBuffer();
        }
        if (dataPos == -1) {
            return -1;
        }
        filePos++;
        return Byte.toUnsignedInt(data[dataCurOffset++]);
    }

Whenever we read byte-by-byte, the data buffer is being filled only once: when dataPos == -1. After that nobody checks if the buffer has to be refilled.

I would suggest the following fix:

    @Override
    public int read() throws IOException {
        LOG.debug("{}: Reading single byte at position {}", this, filePos);
        if (dataPos == -1) {
            fillBuffer();
        }
        if (dataPos == -1) {
            return -1;
        }
        if (data.length == 1 + dataCurOffset) {
          dataPos = -1;
        }
        filePos++;
        return Byte.toUnsignedInt(data[dataCurOffset++]);
    }

Here are the last lines with DEBUG output enabled on the BmcReadAheadFSInputStream:

[2022-04-01 20:41:57,224] com.oracle.bmc.hdfs.store.BmcReadAheadFSInputStream DEBUG - ReadAhead Stream for sample.xlsx: Reading single byte at position 15637
[2022-04-01 20:41:57,225] com.oracle.bmc.hdfs.store.BmcReadAheadFSInputStream DEBUG - ReadAhead Stream for sample.xlsx: Reading single byte at position 15638
[2022-04-01 20:41:57,226] com.oracle.bmc.hdfs.store.BmcReadAheadFSInputStream DEBUG - ReadAhead Stream for sample.xlsx: Reading single byte at position 15639
[2022-04-01 20:42:44,535] com.oracle.bmc.hdfs.store.BmcReadAheadFSInputStream DEBUG - ReadAhead Stream for sample.xlsx: Closing
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 15639
    at com.oracle.bmc.hdfs.store.BmcReadAheadFSInputStream.read(BmcReadAheadFSInputStream.java:68)
    at java.io.FilterInputStream.read(FilterInputStream.java:83)
    at com.homesoft.spark.SparkClient.testBmcReadAhead(SparkClient.java:114)
    at com.homesoft.spark.SparkClient.main(SparkClient.java:102)

I.e. it attempted to read behind the file limits:

$ oci os object head -bn bucket -ns tenancy --name "sample.xlsx"
{
  "accept-ranges": "bytes",
  "access-control-allow-credentials": "true",
  "access-control-allow-methods": "POST,PUT,GET,HEAD,DELETE,OPTIONS",
  "access-control-allow-origin": "*",
  "access-control-expose-headers": "accept-ranges,access-control-allow-credentials,access-control-allow-methods,access-control-allow-origin,content-length,content-md5,content-type,date,etag,last-modified,opc-client-info,opc-client-request-id,opc-request-id,storage-tier,version-id,x-api-id",
  "content-length": "15639",
  "content-md5": "9kPGjn8vsb/iYQpAO92Hnw==",
  "content-type": "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
  "date": "Sat, 02 Apr 2022 01:46:22 GMT",
  "etag": "b25578a9-c839-4066-b932-4375c9027024",
  "last-modified": "Tue, 15 Mar 2022 21:47:13 GMT",
  "opc-client-request-id": "2204E36376B74F8A9E468294C6F4A7BD",
  "opc-request-id": "iad-1:a0P863M9NBeezUMrut68Ogv9OCJIY3MJiCvFHIMtyirQ98Rb_hpiIji7WTylLeqU",
  "storage-tier": "Standard",
  "version-id": "b934ca25-c585-4d5f-bc49-0bc85bc07199",
  "x-api-id": "native"
}
$
jodoglevy commented 2 years ago

Thanks for reporting - we'll take a look

mricken commented 2 years ago

Thank you for reporting this. I have reproduced and fixed the problem in our internal preview version, and the fix is out for code review. I'll let you know when we are releasing the fix publicly.

mricken commented 2 years ago

We're preparing to release this bugfix on April 26, 2022. Thank you for your patience.

mricken commented 2 years ago

Hi @kvirund , we have just released version 3.3.1.0.3.3 to GitHub, and the Maven Central release is making progress.

I'll close this issue, because I'm confident that the bug you reported has been fixed. If not, please feel free to re-open the issue. Thank you.