samtools / htsjdk

A Java API for high-throughput sequencing data (HTS) formats.
http://samtools.github.io/htsjdk/
283 stars 242 forks source link

References in different order to headers #1271

Open jkbonfield opened 5 years ago

jkbonfield commented 5 years ago

Verify

When picard (which I assume means htsjdk) fails, it does so silently and exits 0.

Subject of the issue

If we construct a file where the data is position sorted within each chromosome, but the chromosomes themselves are not in the canonical header order, various things do not work.

I am unsure what is meant to work, but there are subtle differences between CRAM and BAM here, plus some things that don't work on either, plus no errors returned to the user. If this is an illegal file structure then we should see an error, but I don't believe it is illegal to do this. I can't find anything in the BAM spec that indicates files must match the SQ header order.

Steps to reproduce

Example to reproduce whole file streaming differences (BAM vs CRAM):

# Create 10 bam and cram files with snippet from first 10 chromosomes
$ for i in `seq 1 10`;do samtools view -@8 -O cram -o /tmp/_$i.cram ~/scratch/data/9827_2#49.cram $i:20000000-25000000;done
$ for i in `seq 1 10`;do samtools view -@8 -O bam -o /tmp/_$i.bam ~/scratch/data/9827_2#49.cram $i:20000000-25000000;done

# Concatenate in random order
$ samtools cat /tmp/_{9,2,1,6,10,4,7,8,3,5}.bam -o /tmp/_.bam
$ samtools cat /tmp/_{9,2,1,6,10,4,7,8,3,5}.cram -o /tmp/_.cram

# Count entries
$ samtools view -c /tmp/_.bam
935949
$ samtools view -c /tmp/_.cram
935949

$ /usr/bin/java -jar ~/work/picard-2.18.15.jar ViewSam INPUT=/tmp/_.bam ALIGNMENT_STATUS=All PF_STATUS=All REFERENCE_SEQUENCE=$HREF 2>/tmp/err | egrep -cv '^@'
935949
$ /usr/bin/java -jar ~/work/picard-2.18.15.jar ViewSam INPUT=/tmp/_.cram ALIGNMENT_STATUS=All PF_STATUS=All REFERENCE_SEQUENCE=$HREF 2>/tmp/err | egrep -cv '^@'
98188

Example to reproduce htsjdk vs samtools differences in random access:

#  Create input
$ samtools cat /tmp/_{9,2,1,6,10,4,7,8,3,5}.bam -o /tmp/_.bam; samtools index /tmp/_.bam
$ samtools cat /tmp/_{1,2,3,4,5,6,7,8,9,10}.bam -o /tmp/_o.bam; samtools index /tmp/_o.bam

# Create region list
$ (samtools view -H /tmp/_.bam;for i in 3 8 2 4;do echo -e "$i\t22000000\t24000000\t+\ttarget_$i";done) > /tmp/_reg.list

# Query header-sorted: OK
$ usr/bin/java -jar ~/work/picard-2.18.15.jar ViewSam INPUT=/tmp/_o.cram ALIGNMENT_STATUS=All PF_STATUS=All REFERENCE_SEQUENCE=$HREF INTERVAL_LIST=/tmp/_reg.list  2>/dev/null | egrep -cv '^@'
149025
$ usr/bin/java -jar ~/work/picard-2.18.15.jar ViewSam INPUT=/tmp/_o.bam ALIGNMENT_STATUS=All PF_STATUS=All REFERENCE_SEQUENCE=$HREF INTERVAL_LIST=/tmp/_reg.list  2>/dev/null | egrep -cv '^@'
149025

# Query out-of-order references: silent failure
$ /usr/bin/java -jar ~/work/picard-2.18.15.jar ViewSam INPUT=/tmp/_.cram ALIGNMENT_STATUS=All PF_STATUS=All REFERENCE_SEQUENCE=$HREF INTERVAL_LIST=/tmp/_reg.list  2>/dev/null | egrep -cv '^@'
110551

$ /usr/bin/java -jar ~/work/picard-2.18.15.jar ViewSam INPUT=/tmp/_.bam ALIGNMENT_STATUS=All PF_STATUS=All REFERENCE_SEQUENCE=$HREF INTERVAL_LIST=/tmp/_reg.list  2>/dev/null | egrep -cv '^@'
110551

# With samtools
$ for i in _.cram _.bam _o.cram _o.bam;do samtools view -@4 -c /tmp/$i $(for i in 3 8 2 4;do echo $i:22000000-24000000;done);done
149025
149025
149025
149025

# Validation against original data file (Picard also matches this, for both bam and cram)
$ samtools view -@4 -c ~/scratch/data/9827_2#49.bam $(for i in 3 8 2 4;do echo $i:22000000-24000000;done)
149025

I'm not sure where 110551 is coming from. It's 38474 short of the correct value, which is the number of records that should be retrieved from chr 3. Why miss the first one?

Your environment

openjdk 10.0.2 2018-07-17 picard-2.18.15 Unknown htsjdk version. I am assuming the bug is in the library, at least in part given BAM vs CRAM differences, but I'm using whatever is bundled with picard. OS: Ubuntu (precise IIRC which machine I was on).

yfarjoun commented 5 years ago

if it isn't in the spec it should be!

On Fri, Feb 1, 2019 at 11:31 AM James Bonfield notifications@github.com wrote:

Verify

When picard (which I assume means htsjdk) fails, it does so silently and exits 0. Subject of the issue

If we construct a file where the data is position sorted within each chromosome, but the chromosomes themselves are not in the canonical header order, various things do not work.

I am unsure what is meant to work, but there are subtle differences between CRAM and BAM here, plus some things that don't work on either, plus no errors returned to the user. If this is an illegal file structure then we should see an error, but I don't believe it is illegal to do this. I can't find anything in the BAM spec that indicates files must match the SQ header order. Steps to reproduce

Example to reproduce whole file streaming differences (BAM vs CRAM):

Create 10 bam and cram files with snippet from first 10 chromosomes

$ for i in seq 1 10;do samtools view -@8 -O cram -o /tmp/_$i.cram ~/scratch/data/98272#49.cram $i:20000000-25000000;done $ for i in seq 1 10;do samtools view -@8 -O bam -o /tmp/$i.bam ~/scratch/data/9827_2#49.cram $i:20000000-25000000;done

Concatenate in random order

$ samtools cat /tmp/{9,2,1,6,10,4,7,8,3,5}.bam -o /tmp/.bam $ samtools cat /tmp/{9,2,1,6,10,4,7,8,3,5}.cram -o /tmp/.cram

Count entries

$ samtools view -c /tmp/.bam 935949 $ samtools view -c /tmp/.cram 935949

$ /usr/bin/java -jar ~/work/picard-2.18.15.jar ViewSam INPUT=/tmp/_.bam ALIGNMENT_STATUS=All PF_STATUS=All REFERENCESEQUENCE=$HREF 2>/tmp/err | egrep -cv '^@' 935949 $ /usr/bin/java -jar ~/work/picard-2.18.15.jar ViewSam INPUT=/tmp/.cram ALIGNMENT_STATUS=All PF_STATUS=All REFERENCE_SEQUENCE=$HREF 2>/tmp/err | egrep -cv '^@' 98188

Example to reproduce htsjdk vs samtools differences in random access:

Create input

$ samtools cat /tmp/{9,2,1,6,10,4,7,8,3,5}.bam -o /tmp/.bam; samtools index /tmp/.bam $ samtools cat /tmp/{1,2,3,4,5,6,7,8,9,10}.bam -o /tmp/_o.bam; samtools index /tmp/_o.bam

Create region list

$ (samtools view -H /tmp/.bam;for i in 3 8 2 4;do echo -e "$i\t22000000\t24000000\t+\ttarget$i";done) > /tmp/_reg.list

Query header-sorted: OK

$ usr/bin/java -jar ~/work/picard-2.18.15.jar ViewSam INPUT=/tmp/_o.cram ALIGNMENT_STATUS=All PF_STATUS=All REFERENCE_SEQUENCE=$HREF INTERVAL_LIST=/tmp/_reg.list 2>/dev/null | egrep -cv '^@' 149025 $ usr/bin/java -jar ~/work/picard-2.18.15.jar ViewSam INPUT=/tmp/_o.bam ALIGNMENT_STATUS=All PF_STATUS=All REFERENCE_SEQUENCE=$HREF INTERVAL_LIST=/tmp/_reg.list 2>/dev/null | egrep -cv '^@' 149025

Query out-of-order references: silent failure

$ /usr/bin/java -jar ~/work/picard-2.18.15.jar ViewSam INPUT=/tmp/_.cram ALIGNMENT_STATUS=All PF_STATUS=All REFERENCE_SEQUENCE=$HREF INTERVAL_LIST=/tmp/_reg.list 2>/dev/null | egrep -cv '^@' 110551

$ /usr/bin/java -jar ~/work/picard-2.18.15.jar ViewSam INPUT=/tmp/_.bam ALIGNMENT_STATUS=All PF_STATUS=All REFERENCE_SEQUENCE=$HREF INTERVAL_LIST=/tmp/_reg.list 2>/dev/null | egrep -cv '^@' 110551

With samtools

$ for i in .cram .bam _o.cram _o.bam;do samtools view -@4 -c /tmp/$i $(for i in 3 8 2 4;do echo $i:22000000-24000000;done);done 149025 149025 149025 149025

Validation against original data file (Picard also matches this, for both bam and cram)

$ samtools view -@4 -c ~/scratch/data/9827_2#49.bam $(for i in 3 8 2 4;do echo $i:22000000-24000000;done) 149025

I'm not sure where 110551 is coming from. It's 38474 short of the correct value, which is the number of records that should be retrieved from chr 3. Why miss the first one? Your environment

openjdk 10.0.2 2018-07-17 picard-2.18.15 Unknown htsjdk version. I am assuming the bug is in the library, at least in part given BAM vs CRAM differences, but I'm using whatever is bundled with picard. OS: Ubuntu (precise IIRC which machine I was on).

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/samtools/htsjdk/issues/1271, or mute the thread https://github.com/notifications/unsubscribe-auth/ACnk0k7MWRSJ9fSYSzFDhiD1c9wziQY_ks5vJGvWgaJpZM4aepoL .

yfarjoun commented 5 years ago

For coordinate sort, the major sort key is the RNAME field, with order defined by the order of @SQ lines in the header.

On Fri, Feb 1, 2019 at 11:34 AM Yossi Farjoun farjoun@broadinstitute.org wrote:

if it isn't in the spec it should be!

On Fri, Feb 1, 2019 at 11:31 AM James Bonfield notifications@github.com wrote:

Verify

When picard (which I assume means htsjdk) fails, it does so silently and exits 0. Subject of the issue

If we construct a file where the data is position sorted within each chromosome, but the chromosomes themselves are not in the canonical header order, various things do not work.

I am unsure what is meant to work, but there are subtle differences between CRAM and BAM here, plus some things that don't work on either, plus no errors returned to the user. If this is an illegal file structure then we should see an error, but I don't believe it is illegal to do this. I can't find anything in the BAM spec that indicates files must match the SQ header order. Steps to reproduce

Example to reproduce whole file streaming differences (BAM vs CRAM):

Create 10 bam and cram files with snippet from first 10 chromosomes

$ for i in seq 1 10;do samtools view -@8 -O cram -o /tmp/_$i.cram ~/scratch/data/98272#49.cram $i:20000000-25000000;done $ for i in seq 1 10;do samtools view -@8 -O bam -o /tmp/$i.bam ~/scratch/data/9827_2#49.cram $i:20000000-25000000;done

Concatenate in random order

$ samtools cat /tmp/{9,2,1,6,10,4,7,8,3,5}.bam -o /tmp/.bam $ samtools cat /tmp/{9,2,1,6,10,4,7,8,3,5}.cram -o /tmp/.cram

Count entries

$ samtools view -c /tmp/.bam 935949 $ samtools view -c /tmp/.cram 935949

$ /usr/bin/java -jar ~/work/picard-2.18.15.jar ViewSam INPUT=/tmp/_.bam ALIGNMENT_STATUS=All PF_STATUS=All REFERENCESEQUENCE=$HREF 2>/tmp/err | egrep -cv '^@' 935949 $ /usr/bin/java -jar ~/work/picard-2.18.15.jar ViewSam INPUT=/tmp/.cram ALIGNMENT_STATUS=All PF_STATUS=All REFERENCE_SEQUENCE=$HREF 2>/tmp/err | egrep -cv '^@' 98188

Example to reproduce htsjdk vs samtools differences in random access:

Create input

$ samtools cat /tmp/{9,2,1,6,10,4,7,8,3,5}.bam -o /tmp/.bam; samtools index /tmp/.bam $ samtools cat /tmp/{1,2,3,4,5,6,7,8,9,10}.bam -o /tmp/_o.bam; samtools index /tmp/_o.bam

Create region list

$ (samtools view -H /tmp/.bam;for i in 3 8 2 4;do echo -e "$i\t22000000\t24000000\t+\ttarget$i";done) > /tmp/_reg.list

Query header-sorted: OK

$ usr/bin/java -jar ~/work/picard-2.18.15.jar ViewSam INPUT=/tmp/_o.cram ALIGNMENT_STATUS=All PF_STATUS=All REFERENCE_SEQUENCE=$HREF INTERVAL_LIST=/tmp/_reg.list 2>/dev/null | egrep -cv '^@' 149025 $ usr/bin/java -jar ~/work/picard-2.18.15.jar ViewSam INPUT=/tmp/_o.bam ALIGNMENT_STATUS=All PF_STATUS=All REFERENCE_SEQUENCE=$HREF INTERVAL_LIST=/tmp/_reg.list 2>/dev/null | egrep -cv '^@' 149025

Query out-of-order references: silent failure

$ /usr/bin/java -jar ~/work/picard-2.18.15.jar ViewSam INPUT=/tmp/_.cram ALIGNMENT_STATUS=All PF_STATUS=All REFERENCE_SEQUENCE=$HREF INTERVAL_LIST=/tmp/_reg.list 2>/dev/null | egrep -cv '^@' 110551

$ /usr/bin/java -jar ~/work/picard-2.18.15.jar ViewSam INPUT=/tmp/_.bam ALIGNMENT_STATUS=All PF_STATUS=All REFERENCE_SEQUENCE=$HREF INTERVAL_LIST=/tmp/_reg.list 2>/dev/null | egrep -cv '^@' 110551

With samtools

$ for i in .cram .bam _o.cram _o.bam;do samtools view -@4 -c /tmp/$i $(for i in 3 8 2 4;do echo $i:22000000-24000000;done);done 149025 149025 149025 149025

Validation against original data file (Picard also matches this, for both bam and cram)

$ samtools view -@4 -c ~/scratch/data/9827_2#49.bam $(for i in 3 8 2 4;do echo $i:22000000-24000000;done) 149025

I'm not sure where 110551 is coming from. It's 38474 short of the correct value, which is the number of records that should be retrieved from chr 3. Why miss the first one? Your environment

openjdk 10.0.2 2018-07-17 picard-2.18.15 Unknown htsjdk version. I am assuming the bug is in the library, at least in part given BAM vs CRAM differences, but I'm using whatever is bundled with picard. OS: Ubuntu (precise IIRC which machine I was on).

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/samtools/htsjdk/issues/1271, or mute the thread https://github.com/notifications/unsubscribe-auth/ACnk0k7MWRSJ9fSYSzFDhiD1c9wziQY_ks5vJGvWgaJpZM4aepoL .

jkbonfield commented 5 years ago

Maybe, but splitting into chromosomes files, operating in parallel and then merging by fast concatenation is not an uncommon strategy. Care must be taken to concatenate in the correct order, but I could imagine a naive "samtools cat my_file*.bam" and not noticing the lack of ordering if tools just carry on working regardless.

Checking the spec I see it now does clarify this:

For coordinate sort, the major sort key is the {\sf RNAME} field, with order defined by the order of {\tt @SQ} lines in the header. The minor sort key is the {\sf POS} field.

In that case please consider this issue to be a bug report for silently failing when it should be noisy.

yfarjoun commented 5 years ago

in that case we should have the discussion in picard, not htsjdk...

but to the point, ViewSam doesn't validate it....Try running your file through ValidateSamFile and I suspect you'll get the error you are looking for.

jkbonfield commented 5 years ago

I forgot to add the link to the other issue where I was prompted to file this:

https://github.com/samtools/hts-specs/issues/376

I still don't know if this is picard or htsjdk as I don't know what check is where. Giving different behaviour on BAM vs CRAM does sound like it's more likely to be an htsjdk check.

Of course maybe this is an issue for samtools too! If this is illegal (which it appears to be) then samtools cat could spot it upfront and fail early. However for whatever reason, someone went to long extents to ensure samtools and htslib works with out of order references. I haven't researched why and whom yet.

yfarjoun commented 5 years ago

well...the problem is that sometimes there's an out of order bam/sam/cram that needs to be fixed...so you don't want htsjdk/htslib to fail outright since then it will be difficult to write a tool that will fix it....

jkbonfield commented 5 years ago

That's a good point. I think we should probably be issuing warnings but not make them fatal so people can fix data (eg sort runs so it'll get back to a known good state).

cmnbroad commented 5 years ago

Added CRAM label due to the first issue above ("whole file streaming differences (BAM vs CRAM)":

$ /usr/bin/java -jar ~/work/picard-2.18.15.jar ViewSam INPUT=/tmp/_.bam ALIGNMENT_STATUS=All PF_STATUS=All REFERENCE_SEQUENCE=$HREF 2>/tmp/err | egrep -cv '^@'
935949
$ /usr/bin/java -jar ~/work/picard-2.18.15.jar ViewSam INPUT=/tmp/_.cram ALIGNMENT_STATUS=All PF_STATUS=All REFERENCE_SEQUENCE=$HREF 2>/tmp/err | egrep -cv '^@'
98188