samtools / htsjdk

A Java API for high-throughput sequencing data (HTS) formats.
http://samtools.github.io/htsjdk/
283 stars 243 forks source link

SamReader slower than expected on network drive #1660

Open BeatWolf opened 1 year ago

BeatWolf commented 1 year ago

Description of the issue:

Converting a SAM file to a BAM file was slower than expected. When looking at the profiler i noticed the following:

image

The SAMReader uses about 20% of its time in the File.length() method. While this method is basically "free" on a local filesystem, it is not when using a network drive.

This could easily be fixed by simply caching the size of the file in the constructor of htsjdk.samtools.seekablestream.SeekableFileStream.

Your environment:

Steps to reproduce

Put a SAM file on a network drive (in my case a Synology NAS with an SMB connection). Read the file and profile it.

Expected behaviour

The code should not spend 20% of the time getting the length of the file.

Actual behaviour

The code asks the remote file system constantly how big the file is.

cmnbroad commented 1 year ago

Yeah, it looks like we should cache the length - some of SeekableFileStream's sibling classes already do this.