We have developed S3ReaderFactory plugin - an HTSJDK CustomReaderFactory implementation. It enables comfortable and fast direct reading of the SAM/BAM files stored in AWS S3 storage (both private and public buckets).
The key factor of speed increase is simultaneous download of file parts (chunks) via multiple connections. Our tests showed an increase of Picard viewsam speed ten times or more compared to the original version (the effect depends on the number of threads, number of connections, connection speed, etc.).
The plugin does not add any new dependencies to HTSJDK; it adds new functionality on runtime. Readme is attached.
We would like to offer this code for your review and acceptance, and would like to ask what is the best path forward?
----------< README.MD >---------
Amazon S3 plugin for HTSJDK
Overview
This is a plugin for HTSJDK. This plugin enables multiconnectional loading
of SAM/BAM files stored in AWS S3. The plugin provides implementation of a
custom reader that can be plugged into HTSJDK-based tools. The plugin does not
add any new dependencies to HTSJDK; it is loaded and provides new functionality
at runtime. A guide for benchmarking using public S3 BAM file can be found at
performance_test/public/README.md
The plugin requires Java 1.8, HTSJDK v.2.1.1 or newer.
Plugin version: 1.0.
Build and Usage
In order to build this package use the following command:
./gradlew shadowJar
This command produces one file: s3HtsjdkReaderFactory.jar.
To use this plugin with HTSJDK the custom reader factory should be utilized by
adding:
CLASS_NAME=com.epam.cmbi.s3.S3ReaderFactory
PLUGIN_PATH=[path to S3HtsjdkRaderFactory.jar]
BUCKET=[S3 bucket]
KEY=[S3 key]
PREFIX=[Prefix of your s3 resources URLs (for example for url=https://s3.amazonaws.com/3kricegenome/9311/IRIS_313-15896.realigned.bam prefix would be 'https://s3.amazonaws.com/' or 'https://s3' etc.)]
Example: Working with Picard tools
The plugin is suitable for working with Picard tools (tested with
Picard-tools v.2.0.1) (they will need to be downloaded and built separately, see
instructions here).
It should be possible to run the Picard tools in the following fashion:
CLASS_NAME=com.epam.cmbi.s3.S3ReaderFactory
PLUGIN_PATH=[path to S3HtsjdkRaderFactory.jar]
PICARD_JAR=[path to picard-tools.jar]
METRIC_COMMAND=[metrics with parameters]
BUCKET=[S3 bucket]
KEY=[S3 key]
PREFIX=https://s3.amazonaws.com/
java -Dsamjdk.custom_reader=$PREFIX,$CLASS_NAME,$PLUGIN_PATH -jar $PICARD_JAR $METRIC_COMMAND
For example:
java -Dsamjdk.custom_reader=https://s3.amazonaws.com,com.epam.cmbi.s3.S3ReaderFactory,/path/to/plugin/S3ReaderFactory.jar -jar /path/to/picard/picard.jar ViewSam INPUT=https://s3.amazonaws.com/3kricegenome/9311/IRIS_313-15896.realigned.bam VERBOSITY=INFO VALIDATION_STRINGENCY=SILENT
Interval list
Some of the Picard Tools metrics (e. g. ViewSam) take a list of intervals from
an input file. When an index file is available for the BAM file and an interval
list file is provided, plugin downloads only these intervals.
AWS Authentication
In order to use the plugin AWS credentials need to be set up. Credentials
must be set in at least one of the following locations in order to be used:
credentials file at the following path: [USER_HOME]/.aws/.
This file should contain the following lines in the order they appear here:
Environment variables: AWS_ACCESS_KEY_ID and AWS_SECRET_KEY
Configuration parameters
The plugin has the following configuration parameters (set using JVM options):
* Number of connections to S3
* JVM option `samjdk.s3plugin.number_of_connections`
* Default value: 50
The download process starts with the size of the partition equal to
equal to min_download_chunk_size. The partition size increases up to
max_download_chunk_size
* Min download chunk size
* JVM option `samjdk.s3plugin.min_download_chunk_size`
* Default value: 32768 bytes = 32 kilobytes
* Max download chunk size
* JVM option `samjdk.s3plugin.max_download_chunk_size`
* Default value: 8388608 bytes = 8 megabytes
* Number of connection retries
* JVM option `samjdk.s3plugin.custom_retry_count`
* Default value: 3
* Index file URL
* JVM option `samjdk.s3plugin.index_file_url`
* Default: try to find an index file using the
name of the BAM file (`.bai` or `.bam.bai` extention)
These options can be set using -D$OPTION=$VALUE syntax.
Memory Usage
Theoretical upper memory requirement is calculated using the following formula:
Theoretical upper memory requirement = max chunk size * number of connections * 3 (capacity buffer coefficient).
Default value equals 8MB * 50 * 3 = 1200 MB;
Performance Monitoring
The plugin continuously reports the amount of downloaded data, the number of GET
requests to AWS S3 services and elapsed time. This information is written to the
log every 5 seconds.
Index files
Index files act like as an external table of contents and allow the program to
jump directly to the specific parts of the BAM file without reading all of the
sequence.
The plugin has the following two options for the index file location:
The index file location is specified by the user in configuration parameters;
The index file location is guessed from the BAM file name (if it is not
specified in the configuration parameters.)
In the latter case, the index file is assumed to have the same name as the BAM
file. First, the plugin looks for the .bam.bai file and if it does not exist,
then searches for the .bai files. If the index file location was provided
using the JVM option but its URL is wrong then the IllegalArgumentException
exception is thrown.
The index file is downloaded using a single GET request and over a single
connection.
The BAM files are downloaded using multiple threads and thus retrieve the data
in chunks of configurable size.
The target BAM file is partitioned to chunks while being downloaded.
The chunk size of each downloading thread grows exponentially from
samjdk.s3plugin.min_download_chunk_size to
samjdk.s3plugin.max_download_chunk_size.
It is measured in bytes.
Then number of connections that the plugin creates is
samjdk.s3plugin.number_of_connections. The total number of GET requests for a file equals the total
number of chunks.
Reconnection
The plugin has the ability to reconnect to the server while downloading in case
the connection is lost. It is possible to configure the number of times the
plugin tries to reconnect using the samjdk.s3plugin.custom_retry_count
parameter. Each time it tries to reconnect, the S3 Client makes 10 attempts, one
every 10 seconds.
It might end up downloading a bit of unused data if there is much reconnecting
due to the chunks downloading algorithm.
We downloaded about 1GB more with a 325GB BAM file when testing. This represents
an overhead of about 0.3%.
We have developed S3ReaderFactory plugin - an HTSJDK CustomReaderFactory implementation. It enables comfortable and fast direct reading of the SAM/BAM files stored in AWS S3 storage (both private and public buckets).
The key factor of speed increase is simultaneous download of file parts (chunks) via multiple connections. Our tests showed an increase of Picard viewsam speed ten times or more compared to the original version (the effect depends on the number of threads, number of connections, connection speed, etc.).
The plugin does not add any new dependencies to HTSJDK; it adds new functionality on runtime. Readme is attached.
We would like to offer this code for your review and acceptance, and would like to ask what is the best path forward?
----------< README.MD >---------
Amazon S3 plugin for HTSJDK
Overview
This is a plugin for HTSJDK. This plugin enables multiconnectional loading of SAM/BAM files stored in AWS S3. The plugin provides implementation of a custom reader that can be plugged into HTSJDK-based tools. The plugin does not add any new dependencies to HTSJDK; it is loaded and provides new functionality at runtime. A guide for benchmarking using public S3 BAM file can be found at performance_test/public/README.md
The plugin requires Java 1.8, HTSJDK v.2.1.1 or newer.
Plugin version: 1.0.
Build and Usage
In order to build this package use the following command:
This command produces one file:
s3HtsjdkReaderFactory.jar
.To use this plugin with HTSJDK the custom reader factory should be utilized by adding:
where
Example: Working with Picard tools
The plugin is suitable for working with Picard tools (tested with Picard-tools v.2.0.1) (they will need to be downloaded and built separately, see instructions here).
It should be possible to run the Picard tools in the following fashion:
Interval list
Some of the Picard Tools metrics (e. g. ViewSam) take a list of intervals from an input file. When an index file is available for the BAM file and an interval list file is provided, plugin downloads only these intervals.
AWS Authentication
In order to use the plugin AWS credentials need to be set up. Credentials must be set in at least one of the following locations in order to be used:
credentials
file at the following path:[USER_HOME]/.aws/
. This file should contain the following lines in the order they appear here:AWS_ACCESS_KEY_ID
andAWS_SECRET_KEY
Configuration parameters
The plugin has the following configuration parameters (set using JVM options):
These options can be set using
-D$OPTION=$VALUE
syntax.Memory Usage
Theoretical upper memory requirement is calculated using the following formula:
Theoretical upper memory requirement = max chunk size * number of connections * 3 (capacity buffer coefficient).
Default value equals 8MB * 50 * 3 = 1200 MB;
Performance Monitoring
The plugin continuously reports the amount of downloaded data, the number of GET requests to AWS S3 services and elapsed time. This information is written to the log every 5 seconds.
Index files
Index files act like as an external table of contents and allow the program to jump directly to the specific parts of the BAM file without reading all of the sequence.
The plugin has the following two options for the index file location:
In the latter case, the index file is assumed to have the same name as the BAM file. First, the plugin looks for the
.bam.bai
file and if it does not exist, then searches for the.bai
files. If the index file location was provided using the JVM option but its URL is wrong then theIllegalArgumentException
exception is thrown.Downloading files from AWS S3
The plugin uses AWS Java SDK for downloading files from Amazon S3.
AmazonS3.getObject(GetObjectRequest)
method is used for retrievingS3Object
object which providesS3ObjectInputStream
.getObject
method uses ObjectGET service of the S3 REST API.The index file is downloaded using a single GET request and over a single connection.
The BAM files are downloaded using multiple threads and thus retrieve the data in chunks of configurable size.
The target BAM file is partitioned to chunks while being downloaded.
The chunk size of each downloading thread grows exponentially from
samjdk.s3plugin.min_download_chunk_size
tosamjdk.s3plugin.max_download_chunk_size
. It is measured in bytes.Then number of connections that the plugin creates is
samjdk.s3plugin.number_of_connections
. The total number of GET requests for a file equals the total number of chunks.Reconnection
The plugin has the ability to reconnect to the server while downloading in case the connection is lost. It is possible to configure the number of times the plugin tries to reconnect using the
samjdk.s3plugin.custom_retry_count
parameter. Each time it tries to reconnect, the S3 Client makes 10 attempts, one every 10 seconds.It might end up downloading a bit of unused data if there is much reconnecting due to the chunks downloading algorithm. We downloaded about 1GB more with a 325GB BAM file when testing. This represents an overhead of about 0.3%.