samtools / samtools

Tools (written in C using htslib) for manipulating next-generation sequencing data
http://htslib.org/
Other
1.62k stars 580 forks source link

samtools index fails #730

Closed trifud closed 7 years ago

trifud commented 7 years ago

Hello, I have been using samtools for ages and it has worked for the most time. However, today I experienced the following issue:

samtools index SRR2034775.bam 
[E::bgzf_read] Read block operation failed with error -1 after 0 of 4 bytes
samtools index: failed to create index for "SRR2034775.bam"

The file is 53 GB large. I have worked with larger files before. The issue is reproducible. I have samtools 1.6:

git log -1
commit dd8cab51e650882ffd6bd0b0c12837642b48bfcf (HEAD -> develop, origin/develop, origin/HEAD)
Merge: c944125 f4dc22a
Author: Rob Davies <rmd+git@sanger.ac.uk>
Date:   Thu Sep 28 14:29:33 2017 +0100

System info:

lsb_release -d
Description:    Fedora release 26 (Twenty Six)

uname -r
4.13.5-200.fc26.x86_64

grep "model name" /proc/cpuinfo | uniq
model name  : AMD Ryzen 5 1600X Six-Core Processor

free -g
              total        used        free      shared  buff/cache   available
Mem:             62          16           0           0          45          45
Swap:            29           0          29
jkbonfield commented 7 years ago

Thanks for the clear and precise information. I am curious where the file came from however.

All I can see is a FASTQ file for this run (https://www.ebi.ac.uk/ena/data/view/SRR2034775)

How did you align it? Do you know for sure the BAM is actually valid? Often corrupted files show up as indexing problems because that is the first command ran. So does "samtools view" also show a problem (redirected to /dev/null for sanity)? If so, how about an independent implementation, such as picard?

If it can be viewed but not indexed, then it is a far more intriguing problem.

trifud commented 7 years ago

I generated the BAM like this:

bwa mem -t 12 /opt/bwa/resources/hg19.fa SRR2034775.fastq | samtools view -bS - | samtools sort -m 3G -o SRR2034775.bam -@12

I think that the BAM is corrupt since it does not sort with Picard Tools either:

java -Xmx4g -jar /opt/picard-tools/latest/picard.jar BuildBamIndex I=SRR2034775.bam 
[Mon Oct 23 20:01:40 CEST 2017] picard.sam.BuildBamIndex INPUT=SRR2034775.bam    VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json
[Mon Oct 23 20:01:40 CEST 2017] Executing as atanas@titanium on Linux 4.13.5-200.fc26.x86_64 amd64; OpenJDK 64-Bit Server VM 1.8.0_144-b01; Picard version: 2.9.2-SNAPSHOT
[Mon Oct 23 20:19:48 CEST 2017] picard.sam.BuildBamIndex done. Elapsed time: 18.14 minutes.
Runtime.totalMemory()=984612864
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
Exception in thread "main" htsjdk.samtools.SAMFormatException: Unrecognized tag type: 
    at htsjdk.samtools.BinaryTagCodec.readSingleValue(BinaryTagCodec.java:351)
    at htsjdk.samtools.BinaryTagCodec.readTags(BinaryTagCodec.java:282)
    at htsjdk.samtools.BAMRecord.decodeAttributes(BAMRecord.java:313)
    at htsjdk.samtools.BAMRecord.getAttribute(BAMRecord.java:293)
    at htsjdk.samtools.SAMRecord.isValid(SAMRecord.java:2004)
    at htsjdk.samtools.BAMFileReader$BAMFileIterator.advance(BAMFileReader.java:795)
    at htsjdk.samtools.BAMFileReader$BAMFileIterator.next(BAMFileReader.java:781)
    at htsjdk.samtools.BAMFileReader$BAMFileIterator.next(BAMFileReader.java:751)
    at htsjdk.samtools.SamReader$AssertingIterator.next(SamReader.java:569)
    at htsjdk.samtools.SamReader$AssertingIterator.next(SamReader.java:543)
    at htsjdk.samtools.BAMIndexer.createIndex(BAMIndexer.java:305)
    at htsjdk.samtools.BAMIndexer.createIndex(BAMIndexer.java:289)
    at picard.sam.BuildBamIndex.doWork(BuildBamIndex.java:147)
    at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:205)
    at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:94)
    at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:104)
wdecoster commented 7 years ago

Would it be that you didn't pipe correctly to samtools sort?

I think your command bwa mem -t 12 /opt/bwa/resources/hg19.fa SRR2034775.fastq | samtools view -bS - | samtools sort -m 3G -o SRR2034775.bam -@12

should be changed to: bwa mem -t 12 /opt/bwa/resources/hg19.fa SRR2034775.fastq | samtools view -bS - | samtools sort -m 3G -o SRR2034775.bam -@12 -

Note the final dash - telling samtools sort it has to read from stdin.

Also, note that your command is longer than necessary: bwa mem -t 12 /opt/bwa/resources/hg19.fa SRR2034775.fastq | samtools sort -m 3G -o SRR2034775.bam -@12 - would do the same thing

trifud commented 7 years ago

I don't think that the command is wrong because it is a part of a script that I put together and I have processed hundreds (if not thousands) of other files without any problem:


#!/bin/bash 

PROJECT="kennwick"

for i in `ls | grep fastq.gz | sed 's/.fastq.gz//g'`; do
    DATE=`date`
    echo "${DATE}: Processing $i.fastq"

    gunzip $i.fastq.gz
    bwa mem -t 12 /opt/bwa/resources/hg19.fa $i.fastq | samtools view -bS - | samtools sort -m 3G -o /trunk/adna/${PROJECT}/bam/$i.bam -@12 && rm $i.fastq
    (samtools index /trunk/adna/${PROJECT}/bam/$i.bam && samtools view /trunk/adna/${PROJECT}/bam/$i.bam Y -h > /scratch/${PROJECT}/chry/$i.bam)&
done

I will try with the shorter pipeline.

I will process the file again but it is going to take about eight hours on my system. Is there any way to extract the Y-chromosome data faster?

wdecoster commented 7 years ago

I stand corrected - it appears to work fine without specifying the -. My apologies.

jkbonfield commented 7 years ago

Thanks for getting back to us, and indeed I don't think this is an index bug so it's right to close to ticket.

I'm curious as to what corrupted the BAM though. Is it a "convenient" looking size (eg a multiple of 512, 1024 or similar) indicating possible file truncation? Does it end in a BAM EOF block?

$ tail -28c 9827_2#49.bam|hd
00000000  1f 8b 08 04 00 00 00 00  00 ff 06 00 42 43 02 00  |............BC..|
00000010  1b 00 03 00 00 00 00 00  00 00 00 00              |............|
0000001c
$ tail -28c 9827_2#49.bam|zcat
[no output]

Those last 28 bytes are the EOF block, representing an empty deflate stream.

One possibility is running out of temporary disk space too. As sort outputs chunks which are then merged together, you need at least 2x (maybe 2.5 to 3x) storage than the final BAM takes up. In theory samtools will spot running out of disk space, but I'm not sure this has always been the case!

As for extracting just one chromosome, this is trivial once sorted and indexed (ironic). Prior to that you're best bet is a grep or awk script, but the slow bit is the alignment I expect and you didn't cache that output.

Note you can simplify your command line too. Samtools subcommands now accept any format, not just BAM. So you could use bwa ... | samtools sort without an intervening SAM to BAM conversion step.

trifud commented 7 years ago

This is what I get:

$ ls -l SRR2034775.bam
-rw-rw-r--. 1 atanas atanas 56590675940 21 okt 11.25 SRR2034775.bam

$ tail -28c SRR2034775.bam | hexdump -C
00000000  1f 8b 08 04 00 00 00 00  00 ff 06 00 42 43 02 00  |............BC..|
00000010  1b 00 03 00 00 00 00 00  00 00 00 00              |............|
0000001c

$ tail -28c SRR2034775.bam | zcat
[no output]

Typically, when samtools runs out of disk space, the process just hangs. The CPU does not work but memory is allocated and samtools does neither quit, nor continue when disk space is freed. I have two physicals disks with a lot of free space (terabytes).

I will align the file from scratch in the coming days and give it another try.

Thank you for the shortened command!