gt1 / biobambam2

Tools for early stage alignment file processing
Other
93 stars 17 forks source link

bamdownsamplerandom - bam vs cram #17

Closed keiranmraine closed 8 years ago

keiranmraine commented 8 years ago

Hi,

I took a 29GB BAM file and converted it to CRAM using scramble with default parameters (resulting file 13GB).

I then ran bamdownsamplerandom (p=0.1) on both files. The BAM downsample completed in ~45 minutes. The CRAM version has now been running for 23h.

(biobambam2-2.0.33-release-20160317091357-x86_64-etch-linux-gnu)

gt1 commented 8 years ago

Hi Keiran,

could you please provide some more information on this, in particular:

Thanks German

keiranmraine commented 8 years ago

Hi German,

This was still running today (8036 min).

Created with:

$ scramble -h
  -=- sCRAMble -=-     version 1.14.6
...
$ samtools view -H in.bam | head -n 1
@HD VN:1.5  SO:coordinate
$ scramble -I BAM -O CRAM -r mouse/37/genome.fa -t 4 in.bam out.cram

Back to BAM with scramble:

$ time scramble -r /nfs/cancer_ref01/mouse/37/genome.fa -I CRAM -O BAM cram/MD5146a.cram roundtrip.bam
real    166m21.278s
user    162m9.976s
sys 4m10.864s

The bamcollate2 job is still running, it's about 25% of the final BAM size after 41 minutes.

keiranmraine commented 8 years ago
$ time bamcollate2 inputformat=cram collate=0 filename=out.cram O=collateTest
1       44546.7
...
[V] 456630493
[V] MemUsage(size=64.2461,rss=6.07812,peak=548.156) wall clock time 02:53:46:88648199

real    173m46.939s
user    167m44.133s
sys 6m2.911s

Same size as scramble CRAM fro roundtrip (27GB).

gt1 commented 8 years ago

Hi Keiran,

could you please retry with version 2.0.35?

(see https://github.com/gt1/biobambam2/releases/download/2.0.35-release-20160330111451/biobambam2-2.0.35-release-20160330111451-x86_64-etch-linux-gnu.tar.gz )

I am not sure this will make any difference. I have not been able to reproduce the issue myself so far.

German

keiranmraine commented 8 years ago

Running now with updated version:

$ biobambam2-2.0.35-release-20160330111451-x86_64-etch-linux-gnu/bin/bamdownsamplerandom I=cram/MD5146a.cram O=ds.cram p=0.1 inputformat=cram outputformat=cram reference=ascat_ref/genome.fa >& ds_cram.log&

It continues to output: Unknown sort order field: unknown to stderr despite the input being sorted:

$ samtools-1.2 view -H cram/MD5146a.cram | head -n 1
@HD VN:1.5  SO:coordinate

Will update with run-time tomorrow (assuming it completes).

gt1 commented 8 years ago

Hi Keiran,

bamdownsamplerandom collates the input to make sure it keeps both mates of a pair. This means the output is not sorted, even if the input is. As the output is unsorted it may not be very suitable for CRAM output. It may be a good idea to output uncompressed BAM and run the result through bamsort, which then can turn it into CRAM.

Best, German

On 30.03.2016 14:30, Keiran Raine wrote:

Running now with updated version:

|$ biobambam2-2.0.35-release-20160330111451-x86_64-etch-linux-gnu/bin/bamdownsamplerandom I=cram/MD5146a.cram O=ds.cram p=0.1 inputformat=cram outputformat=cram reference=ascat_ref/genome.fa >& ds_cram.log& |

It continues to output: |Unknown sort order field: unknown| to stderr despite the input being sorted:

|$ samtools-1.2 view -H cram/MD5146a.cram | head -n 1 @HD VN:1.5 SO:coordinate |

Will update with run-time tomorrow (assuming it completes).

— You are receiving this because you commented. Reply to this email directly or view it on GitHub https://github.com/gt1/biobambam2/issues/17#issuecomment-203408041

keiranmraine commented 8 years ago

Ah, that's good to know, is this behaviour the same for BAM? I'm assuming it's slightly different as the help suggests BAM output can generate an index file.

gt1 commented 8 years ago

No, the behaviour for BAM is the same. Using the index option only really makes sense for single end reads in this case. Switching it on for paired end reads will most likely quit complaining that the file is unsorted.

gt1 commented 8 years ago

I have added a hash option in the latest release which allows down sampling based on a hash function computed over the read names. This avoid the need for collation.