samtools / htsjdk

A Java API for high-throughput sequencing data (HTS) formats.
http://samtools.github.io/htsjdk/
278 stars 244 forks source link

Question: Gettting SAMFileHeader from partial files or byte chunks. #1604

Closed gariem closed 11 months ago

gariem commented 2 years ago

Description of the issue:

This is rather a request or a question about creating a file header from an arbitrary chuck of bytes (i.e. a partial file being downloaded from another location). This is not a bug but any comment/suggestion is welcome.

Your environment:

Steps to reproduce

In the specific scenario when we have large files for which we need to see the sequenceDictionary from the header object using a SAMFileHeader, is there a way to do it without the need of having the entire file available?

Example:

Suppose we have a 200GB file in a remote location and need to access the file header for some processing. Can I transfer only a range of bytes from my remote location and then use those bytes to get a SAMFileHeader? Using samtools to split the data on the remote server is not possible since that storage can't execute commands, only can retrieve entire files or ranges of them.

Expected behaviour

File header is retrieved by providing a partial file.

Actual behaviour

The file header is retrieved after loading the entire file.

lindenb commented 2 years ago

Can I transfer only a range of bytes from my remote location and then use those bytes to get a SAMFileHeader?

yes just open a URL, open a SAMreader using https://www.javadoc.io/doc/com.github.samtools/htsjdk/1.132/htsjdk/samtools/SamReaderFactory.html#open(htsjdk.samtools.SamInputResource)

try(SamReader sr : srf:open(is)) {
SAMFileHeader h = sr.getFileHeader();
}
gariem commented 2 years ago

Thanks @lindenb
I forgot to mention that I have the slightest limitation about the remote file being encrypted, so the most efficient way I have is to download a chunk of bytes and decrypt them before trying to read/build the header object (but downloading everything is not ideal for large files as I said). Otherwise, the recommendation would have been perfect.

lindenb commented 2 years ago

@gariem well, you'll only download the first bytes will the method above, not the whole bam.