samtools / htsjdk

A Java API for high-throughput sequencing data (HTS) formats.
http://samtools.github.io/htsjdk/
283 stars 242 forks source link

Expose option for memory mapping with private MEMORY #1470

Open lbergelson opened 4 years ago

lbergelson commented 4 years ago

Users who are using FUSE to access remote file systems are having trouble with the bam index memory mapping. I've been told that exposing the ability to specify the PRIVATE option when opening a memory mapped file will fix the problem:

I have been working on a filesystem that will allow mounting files remotely with WDL for dnanexus (https://github.com/dnanexus/dxfuse). It is a File System in User Space, which is very common for cloud object systems these days. We just ran into an interesting problem with GATK. It turns out that it mmaps some files, which doesn't work in the general case on FUSE filesystems. Only private mappings are supported. Since all major cloud providers (Amazon, Google, ...) provide such file systems, it would be good if GATK used the "private" flag in its java code.

Here is a good explanation for this: https://github.com/jacobsa/fuse/issues/82

Here is the location in the Linux kernel where you can see FUSE doesn't do shared mappings: https://elixir.bootlin.com/linux/v5.6/source/fs/fuse/file.c#L2306 This is the samtools java code, which I think belongs to the Broad: https://github.com/samtools/htsjdk/blob/16a4e376fbbd2d2d7f85841b8bfcc0e6509f97a8/src/main/java/htsjdk/samtools/MemoryMappedFileBuffer.java

If you add the PRIVATE flag https://docs.oracle.com/javase/7/docs/api/java/nio/channels/FileChannel.MapMode.html#PRIVATE you will be able to use FUSE filesystems. This may help you with the streaming library that you have.

This seems pretty easy to do in htsjdk. Wiring it all the the way to users might be a pain in some places though.

lbergelson commented 4 years ago

@orodeh I've opened this issue to track the issue you brought up. We will need to make a change to htsjdk first and then propagate it to picard/GATK. It might be tricky to expose it universally for picard tools since they don't all share a standard mechanism for opening bam files. It shouldn't be hard to wire itin GATK though.

tfenne commented 4 years ago

@lbergelson Maybe this is a case where a system property / Defaults value makes more sense?

lbergelson commented 4 years ago

Yeah, it might be. I always prefer to configure in code rather than using the system properties though, and since the other memory mapping options are options to the reader rather than system properties I thought it would make sense there.

tfenne commented 4 years ago

both

In all seriousness, I think the best way is to have a Default that configures the default behavior of the SamReadFactory, but then also have API access to override that default. So for folks who are writing code it's easy to configure via API, and others who are e.g. just trying to run picard/gatk/whatever still have a way to control the behavior.

lbergelson commented 4 years ago

I think you're right.

orodeh commented 4 years ago

If you don't actually need the mapping to be shared between several processes, then, it would be nice to make it PRIVATE by default.

lbergelson commented 4 years ago

We almost certainly don't need it shared between several processes.