Open jkbonfield opened 5 years ago
if it isn't in the spec it should be!
On Fri, Feb 1, 2019 at 11:31 AM James Bonfield notifications@github.com wrote:
Verify
When picard (which I assume means htsjdk) fails, it does so silently and exits 0. Subject of the issue
If we construct a file where the data is position sorted within each chromosome, but the chromosomes themselves are not in the canonical header order, various things do not work.
I am unsure what is meant to work, but there are subtle differences between CRAM and BAM here, plus some things that don't work on either, plus no errors returned to the user. If this is an illegal file structure then we should see an error, but I don't believe it is illegal to do this. I can't find anything in the BAM spec that indicates files must match the SQ header order. Steps to reproduce
Example to reproduce whole file streaming differences (BAM vs CRAM):
Create 10 bam and cram files with snippet from first 10 chromosomes
$ for i in
seq 1 10
;do samtools view -@8 -O cram -o /tmp/_$i.cram ~/scratch/data/98272#49.cram $i:20000000-25000000;done $ for i inseq 1 10
;do samtools view -@8 -O bam -o /tmp/$i.bam ~/scratch/data/9827_2#49.cram $i:20000000-25000000;doneConcatenate in random order
$ samtools cat /tmp/{9,2,1,6,10,4,7,8,3,5}.bam -o /tmp/.bam $ samtools cat /tmp/{9,2,1,6,10,4,7,8,3,5}.cram -o /tmp/.cram
Count entries
$ samtools view -c /tmp/.bam 935949 $ samtools view -c /tmp/.cram 935949
$ /usr/bin/java -jar ~/work/picard-2.18.15.jar ViewSam INPUT=/tmp/_.bam ALIGNMENT_STATUS=All PF_STATUS=All REFERENCESEQUENCE=$HREF 2>/tmp/err | egrep -cv '^@' 935949 $ /usr/bin/java -jar ~/work/picard-2.18.15.jar ViewSam INPUT=/tmp/.cram ALIGNMENT_STATUS=All PF_STATUS=All REFERENCE_SEQUENCE=$HREF 2>/tmp/err | egrep -cv '^@' 98188
Example to reproduce htsjdk vs samtools differences in random access:
Create input
$ samtools cat /tmp/{9,2,1,6,10,4,7,8,3,5}.bam -o /tmp/.bam; samtools index /tmp/.bam $ samtools cat /tmp/{1,2,3,4,5,6,7,8,9,10}.bam -o /tmp/_o.bam; samtools index /tmp/_o.bam
Create region list
$ (samtools view -H /tmp/.bam;for i in 3 8 2 4;do echo -e "$i\t22000000\t24000000\t+\ttarget$i";done) > /tmp/_reg.list
Query header-sorted: OK
$ usr/bin/java -jar ~/work/picard-2.18.15.jar ViewSam INPUT=/tmp/_o.cram ALIGNMENT_STATUS=All PF_STATUS=All REFERENCE_SEQUENCE=$HREF INTERVAL_LIST=/tmp/_reg.list 2>/dev/null | egrep -cv '^@' 149025 $ usr/bin/java -jar ~/work/picard-2.18.15.jar ViewSam INPUT=/tmp/_o.bam ALIGNMENT_STATUS=All PF_STATUS=All REFERENCE_SEQUENCE=$HREF INTERVAL_LIST=/tmp/_reg.list 2>/dev/null | egrep -cv '^@' 149025
Query out-of-order references: silent failure
$ /usr/bin/java -jar ~/work/picard-2.18.15.jar ViewSam INPUT=/tmp/_.cram ALIGNMENT_STATUS=All PF_STATUS=All REFERENCE_SEQUENCE=$HREF INTERVAL_LIST=/tmp/_reg.list 2>/dev/null | egrep -cv '^@' 110551
$ /usr/bin/java -jar ~/work/picard-2.18.15.jar ViewSam INPUT=/tmp/_.bam ALIGNMENT_STATUS=All PF_STATUS=All REFERENCE_SEQUENCE=$HREF INTERVAL_LIST=/tmp/_reg.list 2>/dev/null | egrep -cv '^@' 110551
With samtools
$ for i in .cram .bam _o.cram _o.bam;do samtools view -@4 -c /tmp/$i $(for i in 3 8 2 4;do echo $i:22000000-24000000;done);done 149025 149025 149025 149025
Validation against original data file (Picard also matches this, for both bam and cram)
$ samtools view -@4 -c ~/scratch/data/9827_2#49.bam $(for i in 3 8 2 4;do echo $i:22000000-24000000;done) 149025
I'm not sure where 110551 is coming from. It's 38474 short of the correct value, which is the number of records that should be retrieved from chr 3. Why miss the first one? Your environment
openjdk 10.0.2 2018-07-17 picard-2.18.15 Unknown htsjdk version. I am assuming the bug is in the library, at least in part given BAM vs CRAM differences, but I'm using whatever is bundled with picard. OS: Ubuntu (precise IIRC which machine I was on).
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/samtools/htsjdk/issues/1271, or mute the thread https://github.com/notifications/unsubscribe-auth/ACnk0k7MWRSJ9fSYSzFDhiD1c9wziQY_ks5vJGvWgaJpZM4aepoL .
For coordinate sort, the major sort key is the RNAME field, with order defined by the order of @SQ lines in the header.
On Fri, Feb 1, 2019 at 11:34 AM Yossi Farjoun farjoun@broadinstitute.org wrote:
if it isn't in the spec it should be!
On Fri, Feb 1, 2019 at 11:31 AM James Bonfield notifications@github.com wrote:
Verify
When picard (which I assume means htsjdk) fails, it does so silently and exits 0. Subject of the issue
If we construct a file where the data is position sorted within each chromosome, but the chromosomes themselves are not in the canonical header order, various things do not work.
I am unsure what is meant to work, but there are subtle differences between CRAM and BAM here, plus some things that don't work on either, plus no errors returned to the user. If this is an illegal file structure then we should see an error, but I don't believe it is illegal to do this. I can't find anything in the BAM spec that indicates files must match the SQ header order. Steps to reproduce
Example to reproduce whole file streaming differences (BAM vs CRAM):
Create 10 bam and cram files with snippet from first 10 chromosomes
$ for i in
seq 1 10
;do samtools view -@8 -O cram -o /tmp/_$i.cram ~/scratch/data/98272#49.cram $i:20000000-25000000;done $ for i inseq 1 10
;do samtools view -@8 -O bam -o /tmp/$i.bam ~/scratch/data/9827_2#49.cram $i:20000000-25000000;doneConcatenate in random order
$ samtools cat /tmp/{9,2,1,6,10,4,7,8,3,5}.bam -o /tmp/.bam $ samtools cat /tmp/{9,2,1,6,10,4,7,8,3,5}.cram -o /tmp/.cram
Count entries
$ samtools view -c /tmp/.bam 935949 $ samtools view -c /tmp/.cram 935949
$ /usr/bin/java -jar ~/work/picard-2.18.15.jar ViewSam INPUT=/tmp/_.bam ALIGNMENT_STATUS=All PF_STATUS=All REFERENCESEQUENCE=$HREF 2>/tmp/err | egrep -cv '^@' 935949 $ /usr/bin/java -jar ~/work/picard-2.18.15.jar ViewSam INPUT=/tmp/.cram ALIGNMENT_STATUS=All PF_STATUS=All REFERENCE_SEQUENCE=$HREF 2>/tmp/err | egrep -cv '^@' 98188
Example to reproduce htsjdk vs samtools differences in random access:
Create input
$ samtools cat /tmp/{9,2,1,6,10,4,7,8,3,5}.bam -o /tmp/.bam; samtools index /tmp/.bam $ samtools cat /tmp/{1,2,3,4,5,6,7,8,9,10}.bam -o /tmp/_o.bam; samtools index /tmp/_o.bam
Create region list
$ (samtools view -H /tmp/.bam;for i in 3 8 2 4;do echo -e "$i\t22000000\t24000000\t+\ttarget$i";done) > /tmp/_reg.list
Query header-sorted: OK
$ usr/bin/java -jar ~/work/picard-2.18.15.jar ViewSam INPUT=/tmp/_o.cram ALIGNMENT_STATUS=All PF_STATUS=All REFERENCE_SEQUENCE=$HREF INTERVAL_LIST=/tmp/_reg.list 2>/dev/null | egrep -cv '^@' 149025 $ usr/bin/java -jar ~/work/picard-2.18.15.jar ViewSam INPUT=/tmp/_o.bam ALIGNMENT_STATUS=All PF_STATUS=All REFERENCE_SEQUENCE=$HREF INTERVAL_LIST=/tmp/_reg.list 2>/dev/null | egrep -cv '^@' 149025
Query out-of-order references: silent failure
$ /usr/bin/java -jar ~/work/picard-2.18.15.jar ViewSam INPUT=/tmp/_.cram ALIGNMENT_STATUS=All PF_STATUS=All REFERENCE_SEQUENCE=$HREF INTERVAL_LIST=/tmp/_reg.list 2>/dev/null | egrep -cv '^@' 110551
$ /usr/bin/java -jar ~/work/picard-2.18.15.jar ViewSam INPUT=/tmp/_.bam ALIGNMENT_STATUS=All PF_STATUS=All REFERENCE_SEQUENCE=$HREF INTERVAL_LIST=/tmp/_reg.list 2>/dev/null | egrep -cv '^@' 110551
With samtools
$ for i in .cram .bam _o.cram _o.bam;do samtools view -@4 -c /tmp/$i $(for i in 3 8 2 4;do echo $i:22000000-24000000;done);done 149025 149025 149025 149025
Validation against original data file (Picard also matches this, for both bam and cram)
$ samtools view -@4 -c ~/scratch/data/9827_2#49.bam $(for i in 3 8 2 4;do echo $i:22000000-24000000;done) 149025
I'm not sure where 110551 is coming from. It's 38474 short of the correct value, which is the number of records that should be retrieved from chr 3. Why miss the first one? Your environment
openjdk 10.0.2 2018-07-17 picard-2.18.15 Unknown htsjdk version. I am assuming the bug is in the library, at least in part given BAM vs CRAM differences, but I'm using whatever is bundled with picard. OS: Ubuntu (precise IIRC which machine I was on).
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/samtools/htsjdk/issues/1271, or mute the thread https://github.com/notifications/unsubscribe-auth/ACnk0k7MWRSJ9fSYSzFDhiD1c9wziQY_ks5vJGvWgaJpZM4aepoL .
Maybe, but splitting into chromosomes files, operating in parallel and then merging by fast concatenation is not an uncommon strategy. Care must be taken to concatenate in the correct order, but I could imagine a naive "samtools cat my_file*.bam" and not noticing the lack of ordering if tools just carry on working regardless.
Checking the spec I see it now does clarify this:
For coordinate sort, the major sort key is the {\sf RNAME} field, with order defined by the order of {\tt @SQ} lines in the header. The minor sort key is the {\sf POS} field.
In that case please consider this issue to be a bug report for silently failing when it should be noisy.
in that case we should have the discussion in picard, not htsjdk...
but to the point, ViewSam doesn't validate it....Try running your file through ValidateSamFile and I suspect you'll get the error you are looking for.
I forgot to add the link to the other issue where I was prompted to file this:
https://github.com/samtools/hts-specs/issues/376
I still don't know if this is picard or htsjdk as I don't know what check is where. Giving different behaviour on BAM vs CRAM does sound like it's more likely to be an htsjdk check.
Of course maybe this is an issue for samtools too! If this is illegal (which it appears to be) then samtools cat
could spot it upfront and fail early. However for whatever reason, someone went to long extents to ensure samtools and htslib works with out of order references. I haven't researched why and whom yet.
well...the problem is that sometimes there's an out of order bam/sam/cram that needs to be fixed...so you don't want htsjdk/htslib to fail outright since then it will be difficult to write a tool that will fix it....
That's a good point. I think we should probably be issuing warnings but not make them fatal so people can fix data (eg sort runs so it'll get back to a known good state).
Added CRAM label due to the first issue above ("whole file streaming differences (BAM vs CRAM)":
$ /usr/bin/java -jar ~/work/picard-2.18.15.jar ViewSam INPUT=/tmp/_.bam ALIGNMENT_STATUS=All PF_STATUS=All REFERENCE_SEQUENCE=$HREF 2>/tmp/err | egrep -cv '^@'
935949
$ /usr/bin/java -jar ~/work/picard-2.18.15.jar ViewSam INPUT=/tmp/_.cram ALIGNMENT_STATUS=All PF_STATUS=All REFERENCE_SEQUENCE=$HREF 2>/tmp/err | egrep -cv '^@'
98188
Verify
When picard (which I assume means htsjdk) fails, it does so silently and exits 0.
Subject of the issue
If we construct a file where the data is position sorted within each chromosome, but the chromosomes themselves are not in the canonical header order, various things do not work.
I am unsure what is meant to work, but there are subtle differences between CRAM and BAM here, plus some things that don't work on either, plus no errors returned to the user. If this is an illegal file structure then we should see an error, but I don't believe it is illegal to do this. I can't find anything in the BAM spec that indicates files must match the SQ header order.
Steps to reproduce
Example to reproduce whole file streaming differences (BAM vs CRAM):
Example to reproduce htsjdk vs samtools differences in random access:
I'm not sure where 110551 is coming from. It's 38474 short of the correct value, which is the number of records that should be retrieved from chr 3. Why miss the first one?
Your environment
openjdk 10.0.2 2018-07-17 picard-2.18.15 Unknown htsjdk version. I am assuming the bug is in the library, at least in part given BAM vs CRAM differences, but I'm using whatever is bundled with picard. OS: Ubuntu (precise IIRC which machine I was on).