samtools / htsjdk

A Java API for high-throughput sequencing data (HTS) formats.
http://samtools.github.io/htsjdk/
282 stars 243 forks source link

BAM async compression could use multiple threads #559

Open akiezun opened 8 years ago

akiezun commented 8 years ago

Right now, only 1 thread per file is used. Compression easily has enough work for multiple threads and it would speed up the runtime

tfenne commented 8 years ago

I think the real challenge here is how to control the level of thread creation, and how to handle situations where you are writing many BAM files (e.g. think of Picard's IlluminaBasecallsToSam which does demultplexing to BAM).

In an ideal world there could be a pool of threads whose size is configurable/controllable, that is used for all BAM compression and writing within a VM (or factory instance etc.). E.g. I'd love to be able to specify -Dsamjdk.bam_write_threads=8, and have it use 8 threads whether I'm writing 1 or 96 BAM files.

akiezun commented 8 years ago

agreed. A fixed-size pool of threads for all our bam needs (#482 shows that reading also benefits from multiple threads though it would probably not need more than 1 (?))

vadimzalunin commented 8 years ago

Sounds like a singleton blocking queue of BAM blocks. I've added a similar functionality in cramtools recently. I think this should be generalized so that the same mechanism could be used for CRAM etc.

jkbonfield commented 4 years ago

Sorry to rather resurrect an ancient issue, but I see it's still open so perhaps this is better than creating a new one.

I am in the process of evaluating decode and encode speeds for various tools and so far it seems I get absolutely no gain with asynchronous I/O in htsjdk. Firstly, am I using it correct? I'm a java novice, but I have a basic test tool (below) and are using the Java -Dvar=value options to set the defaults. However it doesn't appear to validate these so whatever I type in is swallowed. How do I know if I'm setting the correct variables? Is there a verbose option somewhere?

My command is:

time java -Dhtsjdk.samtools.USE_ASYNC_IO_READ_FOR_SAMTOOLS=true -Dhtsjdk.samtools.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS=true  -cp picard.jar:. TestBamReadWrite.java ../novaseq-10m.bam  /tmp/out.bam

(Sorry, browser failure - it randomly posted my message mid writing. Will edit)

Edit: clearly not those -D options as setting COMPRESSION_LEVEL (as listed in https://samtools.github.io/htsjdk/javadoc/htsjdk/htsjdk/samtools/Defaults.html) has no impact. How do I find out the namespace for these configuration variables please?

import htsjdk.samtools.DefaultSAMRecordFactory;
import htsjdk.samtools.SAMFileWriter;
import htsjdk.samtools.SAMFileWriterFactory;
import htsjdk.samtools.SAMRecord;
import htsjdk.samtools.SamInputResource;
import htsjdk.samtools.SamReader;
import htsjdk.samtools.SamReaderFactory;
import htsjdk.samtools.ValidationStringency;
import htsjdk.samtools.seekablestream.SeekableStream;

import java.io.File;
import java.io.IOException;
import java.net.MalformedURLException;
import java.net.URL;

public class TestBamReadWrite {

public static void main(String[] args) throws IOException {
    SamReader reader = SamReaderFactory
    .makeDefault()
    .validationStringency(ValidationStringency.SILENT)
    .open(new File(args[0]));        // input

    final SAMFileWriter writer = new SAMFileWriterFactory()
    .makeWriter(reader.getFileHeader(),
            true,
            new File(args[1]),   // output
            new File(args[2]));  // reference

    for (SAMRecord record : reader) {
    writer.addAlignment(record);
    }

    reader.close();
    writer.close();
}

}

I realise I can add ".setUseAsyncIo(true)" in my reader and writer creation. Sadly it's pretty much negligible.

Can I confirm please that there is no true multi-threaded decompression or compression capability yet?

lbergelson commented 4 years ago

@jkbonfield It's super confusing... the namespace is samjdk and they're all lowercase. (ex:

-Dsamjdk.use_async_io_read_samtools=true

I don't know why, it's a historical artifact and one that really needs documentation somewhere.

You can assert that the values are getting set by checking the actual value of the variable Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS in your code (or looking at the results of Defaults.allDefaults()

There is currently a single additional core available when compressing / decompressing using the async options. I've been under the impression that it had a fairly significant improvement on write speed and a fairly minor one on read.

There's a long overdue to be merged pr #1249 which offers true multithreaded compression which should help.

jkbonfield commented 4 years ago

Thanks for the answer and for the heads up on 1249. Great to hear something is coming along.

Crispy13 commented 5 months ago

Any update?