Closed trifud closed 7 years ago
Thanks for the clear and precise information. I am curious where the file came from however.
All I can see is a FASTQ file for this run (https://www.ebi.ac.uk/ena/data/view/SRR2034775)
How did you align it? Do you know for sure the BAM is actually valid? Often corrupted files show up as indexing problems because that is the first command ran. So does "samtools view" also show a problem (redirected to /dev/null for sanity)? If so, how about an independent implementation, such as picard?
If it can be viewed but not indexed, then it is a far more intriguing problem.
I generated the BAM like this:
bwa mem -t 12 /opt/bwa/resources/hg19.fa SRR2034775.fastq | samtools view -bS - | samtools sort -m 3G -o SRR2034775.bam -@12
I think that the BAM is corrupt since it does not sort with Picard Tools either:
java -Xmx4g -jar /opt/picard-tools/latest/picard.jar BuildBamIndex I=SRR2034775.bam
[Mon Oct 23 20:01:40 CEST 2017] picard.sam.BuildBamIndex INPUT=SRR2034775.bam VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json
[Mon Oct 23 20:01:40 CEST 2017] Executing as atanas@titanium on Linux 4.13.5-200.fc26.x86_64 amd64; OpenJDK 64-Bit Server VM 1.8.0_144-b01; Picard version: 2.9.2-SNAPSHOT
[Mon Oct 23 20:19:48 CEST 2017] picard.sam.BuildBamIndex done. Elapsed time: 18.14 minutes.
Runtime.totalMemory()=984612864
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
Exception in thread "main" htsjdk.samtools.SAMFormatException: Unrecognized tag type:
at htsjdk.samtools.BinaryTagCodec.readSingleValue(BinaryTagCodec.java:351)
at htsjdk.samtools.BinaryTagCodec.readTags(BinaryTagCodec.java:282)
at htsjdk.samtools.BAMRecord.decodeAttributes(BAMRecord.java:313)
at htsjdk.samtools.BAMRecord.getAttribute(BAMRecord.java:293)
at htsjdk.samtools.SAMRecord.isValid(SAMRecord.java:2004)
at htsjdk.samtools.BAMFileReader$BAMFileIterator.advance(BAMFileReader.java:795)
at htsjdk.samtools.BAMFileReader$BAMFileIterator.next(BAMFileReader.java:781)
at htsjdk.samtools.BAMFileReader$BAMFileIterator.next(BAMFileReader.java:751)
at htsjdk.samtools.SamReader$AssertingIterator.next(SamReader.java:569)
at htsjdk.samtools.SamReader$AssertingIterator.next(SamReader.java:543)
at htsjdk.samtools.BAMIndexer.createIndex(BAMIndexer.java:305)
at htsjdk.samtools.BAMIndexer.createIndex(BAMIndexer.java:289)
at picard.sam.BuildBamIndex.doWork(BuildBamIndex.java:147)
at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:205)
at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:94)
at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:104)
Would it be that you didn't pipe correctly to samtools sort?
I think your command
bwa mem -t 12 /opt/bwa/resources/hg19.fa SRR2034775.fastq | samtools view -bS - | samtools sort -m 3G -o SRR2034775.bam -@12
should be changed to:
bwa mem -t 12 /opt/bwa/resources/hg19.fa SRR2034775.fastq | samtools view -bS - | samtools sort -m 3G -o SRR2034775.bam -@12 -
Note the final dash - telling samtools sort it has to read from stdin.
Also, note that your command is longer than necessary:
bwa mem -t 12 /opt/bwa/resources/hg19.fa SRR2034775.fastq | samtools sort -m 3G -o SRR2034775.bam -@12 -
would do the same thing
I don't think that the command is wrong because it is a part of a script that I put together and I have processed hundreds (if not thousands) of other files without any problem:
#!/bin/bash
PROJECT="kennwick"
for i in `ls | grep fastq.gz | sed 's/.fastq.gz//g'`; do
DATE=`date`
echo "${DATE}: Processing $i.fastq"
gunzip $i.fastq.gz
bwa mem -t 12 /opt/bwa/resources/hg19.fa $i.fastq | samtools view -bS - | samtools sort -m 3G -o /trunk/adna/${PROJECT}/bam/$i.bam -@12 && rm $i.fastq
(samtools index /trunk/adna/${PROJECT}/bam/$i.bam && samtools view /trunk/adna/${PROJECT}/bam/$i.bam Y -h > /scratch/${PROJECT}/chry/$i.bam)&
done
I will try with the shorter pipeline.
I will process the file again but it is going to take about eight hours on my system. Is there any way to extract the Y-chromosome data faster?
I stand corrected - it appears to work fine without specifying the -
. My apologies.
Thanks for getting back to us, and indeed I don't think this is an index bug so it's right to close to ticket.
I'm curious as to what corrupted the BAM though. Is it a "convenient" looking size (eg a multiple of 512, 1024 or similar) indicating possible file truncation? Does it end in a BAM EOF block?
$ tail -28c 9827_2#49.bam|hd
00000000 1f 8b 08 04 00 00 00 00 00 ff 06 00 42 43 02 00 |............BC..|
00000010 1b 00 03 00 00 00 00 00 00 00 00 00 |............|
0000001c
$ tail -28c 9827_2#49.bam|zcat
[no output]
Those last 28 bytes are the EOF block, representing an empty deflate stream.
One possibility is running out of temporary disk space too. As sort outputs chunks which are then merged together, you need at least 2x (maybe 2.5 to 3x) storage than the final BAM takes up. In theory samtools will spot running out of disk space, but I'm not sure this has always been the case!
As for extracting just one chromosome, this is trivial once sorted and indexed (ironic). Prior to that you're best bet is a grep or awk script, but the slow bit is the alignment I expect and you didn't cache that output.
Note you can simplify your command line too. Samtools subcommands now accept any format, not just BAM. So you could use bwa ... | samtools sort
without an intervening SAM to BAM conversion step.
This is what I get:
$ ls -l SRR2034775.bam
-rw-rw-r--. 1 atanas atanas 56590675940 21 okt 11.25 SRR2034775.bam
$ tail -28c SRR2034775.bam | hexdump -C
00000000 1f 8b 08 04 00 00 00 00 00 ff 06 00 42 43 02 00 |............BC..|
00000010 1b 00 03 00 00 00 00 00 00 00 00 00 |............|
0000001c
$ tail -28c SRR2034775.bam | zcat
[no output]
Typically, when samtools runs out of disk space, the process just hangs. The CPU does not work but memory is allocated and samtools does neither quit, nor continue when disk space is freed. I have two physicals disks with a lot of free space (terabytes).
I will align the file from scratch in the coming days and give it another try.
Thank you for the shortened command!
Hello, I have been using samtools for ages and it has worked for the most time. However, today I experienced the following issue:
The file is 53 GB large. I have worked with larger files before. The issue is reproducible. I have samtools 1.6:
System info: