Closed keiranmraine closed 8 years ago
Hi Keiran,
could you please provide some more information on this, in particular:
Thanks German
Hi German,
This was still running today (8036 min).
Created with:
$ scramble -h
-=- sCRAMble -=- version 1.14.6
...
$ samtools view -H in.bam | head -n 1
@HD VN:1.5 SO:coordinate
$ scramble -I BAM -O CRAM -r mouse/37/genome.fa -t 4 in.bam out.cram
Back to BAM with scramble:
$ time scramble -r /nfs/cancer_ref01/mouse/37/genome.fa -I CRAM -O BAM cram/MD5146a.cram roundtrip.bam
real 166m21.278s
user 162m9.976s
sys 4m10.864s
The bamcollate2
job is still running, it's about 25% of the final BAM size after 41 minutes.
$ time bamcollate2 inputformat=cram collate=0 filename=out.cram O=collateTest
1 44546.7
...
[V] 456630493
[V] MemUsage(size=64.2461,rss=6.07812,peak=548.156) wall clock time 02:53:46:88648199
real 173m46.939s
user 167m44.133s
sys 6m2.911s
Same size as scramble CRAM fro roundtrip (27GB).
Hi Keiran,
could you please retry with version 2.0.35?
I am not sure this will make any difference. I have not been able to reproduce the issue myself so far.
German
Running now with updated version:
$ biobambam2-2.0.35-release-20160330111451-x86_64-etch-linux-gnu/bin/bamdownsamplerandom I=cram/MD5146a.cram O=ds.cram p=0.1 inputformat=cram outputformat=cram reference=ascat_ref/genome.fa >& ds_cram.log&
It continues to output: Unknown sort order field: unknown
to stderr despite the input being sorted:
$ samtools-1.2 view -H cram/MD5146a.cram | head -n 1
@HD VN:1.5 SO:coordinate
Will update with run-time tomorrow (assuming it completes).
Hi Keiran,
bamdownsamplerandom collates the input to make sure it keeps both mates of a pair. This means the output is not sorted, even if the input is. As the output is unsorted it may not be very suitable for CRAM output. It may be a good idea to output uncompressed BAM and run the result through bamsort, which then can turn it into CRAM.
Best, German
On 30.03.2016 14:30, Keiran Raine wrote:
Running now with updated version:
|$ biobambam2-2.0.35-release-20160330111451-x86_64-etch-linux-gnu/bin/bamdownsamplerandom I=cram/MD5146a.cram O=ds.cram p=0.1 inputformat=cram outputformat=cram reference=ascat_ref/genome.fa >& ds_cram.log& |
It continues to output: |Unknown sort order field: unknown| to stderr despite the input being sorted:
|$ samtools-1.2 view -H cram/MD5146a.cram | head -n 1 @HD VN:1.5 SO:coordinate |
Will update with run-time tomorrow (assuming it completes).
— You are receiving this because you commented. Reply to this email directly or view it on GitHub https://github.com/gt1/biobambam2/issues/17#issuecomment-203408041
Ah, that's good to know, is this behaviour the same for BAM? I'm assuming it's slightly different as the help suggests BAM output can generate an index file.
No, the behaviour for BAM is the same. Using the index option only really makes sense for single end reads in this case. Switching it on for paired end reads will most likely quit complaining that the file is unsorted.
I have added a hash option in the latest release which allows down sampling based on a hash function computed over the read names. This avoid the need for collation.
Hi,
I took a 29GB BAM file and converted it to CRAM using scramble with default parameters (resulting file 13GB).
I then ran bamdownsamplerandom (p=0.1) on both files. The BAM downsample completed in ~45 minutes. The CRAM version has now been running for 23h.
(biobambam2-2.0.33-release-20160317091357-x86_64-etch-linux-gnu)