Closed Hongjie-Li closed 4 years ago
I also encountered a related error which raises 'unknown type error "0"' during htseq-count for a small fraction of my files. After communicating with @Hongjie-Li and @iosonofabio , this appears to be an issue with the way pysam of htslib (which htseq-count depends on) parses the bam files. @Hongjie-Li and I separately found fixes for when this issue was raised.
My solution is to first convert STAR bam output as sam using samtools, and then run htseq-count on the sam file. One can do this in stream as:
samtools view Aligned.out.bam | htseq-count -f sam - genome_annotation.gtf > htseq.tab
I will raise this issue to htslib, and hopefully this provides a temporary fix for people who encounter a similar issue.
I'm trying to fix issues for a new release, do you guys have a bamfile that I could use for testing? If so please share a google drive link here or via email
Hi Fabio,
Hope you are doing well. Here is the link for the problematic Bam file.
https://drive.google.com/open?id=18BzYTLdmaKp83l02WtuZ7HjTJslGN5T3 https://drive.google.com/open?id=18BzYTLdmaKp83l02WtuZ7HjTJslGN5T3
This is how I ended up solving the problem.
After filtering the BAM file with a samtools command below before htseq-count, the problematic sequences are removed from the BAM file.
samtools view -f 0x2 -b Aligned.out.bam > Aligned.out.pe.bam htseq-count ….
Let us know if you have an update later.
Take care and stay safe! Hongjie
On Apr 5, 2020, at 3:40 AM, Fabio Zanini notifications@github.com wrote:
I'm trying to fix issues for a new release, do you guys have a bamfile that I could use for testing? If so please share a google drive link here or via email
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/simon-anders/htseq/issues/67#issuecomment-609395462, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKWTKKHKRIQWMGRSTDICW4TRLBNZNANCNFSM4GC6BK3A.
Hi Hongjie,
I clicked on the link and is says I need permission. Please enable viewing by anyone! Thank you, Fabio
Hi Fabio,
Sorry for the issue, I just edit the sharing option. Let me know if you can get it or not. Take are!
Best, Hongjie
https://drive.google.com/file/d/18BzYTLdmaKp83l02WtuZ7HjTJslGN5T3/view?usp=sharing https://drive.google.com/file/d/18BzYTLdmaKp83l02WtuZ7HjTJslGN5T3/view?usp=sharing
On Apr 8, 2020, at 5:59 PM, Fabio Zanini notifications@github.com wrote:
Hi Hongjie,
I clicked on the link and is says I need permission. Please enable viewing by anyone! Thank you, Fabio
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/simon-anders/htseq/issues/67#issuecomment-611269734, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKWTKKAQ3ZOJQMGMCLWCVO3RLUMW5ANCNFSM4GC6BK3A.
Thank you, yes I took it down. pysam
says the problematic reads are corrupted in your file:
A00111:215:HC7Y7DSXX:1:2672:3287:26882 409 ERCC-00171 504 0 2M261639N98M * 0 0 AAATGGCCTCACCGTTGACCCGCTTTATGTCGCTGAACCTGCTGCTGCTGGGTGAGTCGATTATCCTGGGGAGTGGAGAAGCTAAGCCACAGGCACCCGA FF:F::,FFFFFFFFFFFFF:F,FF:,FFFF,FF,FF,FFF,FFFF,F:FFFFFFFF:FFF:FFFFFF,,FFFFFFFF:F,FFF:FFFFF,FFFFF:FFF NH:i:5 HI:i:5 AS:i:96 NM:i:1 MD:Z:26C73
[E::sam_format1] Corrupted aux data for read A00111:215:HC7Y7DSXX:1:2672:3287:26882
The first line is a good read, the second line is a problematic read.
Not even samtools can parse that file properly:
$ samtools view -h -o Aligned_not_proper_subsample2.out.sam Aligned_not_proper_subsample.out.bam
[E::sam_format1_append] Corrupted aux data for read A00111:215:HC7Y7DSXX:1:2672:3287:26882
Bottomline: your file is corrupted and/or there is a bug in a very low-level library such as HTSlib. This might or might not be due to an incorrectly set MD
tag by STAR
aligner. I advise you to raise the issue with HTSlib
or samtools
and see what they say. Feel free to mention me.
Of course, if you exclude all inproper reads - which happen by chance to be a strict (and much larger) superset of the corrupted ones - you also fix the corruption issue.
Closing as this is not an HTSeq
issue but upstream.
Thanks for figuring you this out, Fabio! You are the best! Hongjie
On Apr 10, 2020, at 6:44 PM, Fabio Zanini notifications@github.com wrote:
Thank you, yes I took it down. pysam says the problematic reads are corrupted in your file:
A00111:215:HC7Y7DSXX:1:2672:3287:26882 409 ERCC-00171 504 0 2M261639N98M * 0 0 AAATGGCCTCACCGTTGACCCGCTTTATGTCGCTGAACCTGCTGCTGCTGGGTGAGTCGATTATCCTGGGGAGTGGAGAAGCTAAGCCACAGGCACCCGA FF:F::,FFFFFFFFFFFFF:F,FF:,FFFF,FF,FF,FFF,FFFF,F:FFFFFFFF:FFF:FFFFFF,,FFFFFFFF:F,FFF:FFFFF,FFFFF:FFF NH:i:5 HI:i:5 AS:i:96 NM:i:1 MD:Z:26C73
[E::sam_format1] Corrupted aux data for read A00111:215:HC7Y7DSXX:1:2672:3287:26882 The first line is a good read, the second line is a problematic read.
Not even samtools can parse that file properly:
$ samtools view -h -o Aligned_not_proper_subsample2.out.sam Aligned_not_proper_subsample.out.bam [E::sam_format1_append] Corrupted aux data for read A00111:215:HC7Y7DSXX:1:2672:3287:26882 Bottomline: your file is corrupted and/or there is a bug in a very low-level library such as HTSlib. This might or might not be due to an incorrectly set MD tag by STAR aligner. I advise you to raise the issue with HTSlib or samtools and see what they say. Feel free to mention me.
Of course, if you exclude all inproper reads - which happen by chance to be a strict (and much larger) superset of the corrupted ones - you also fix the corruption issue.
Closing as this is not an HTSeq issue but upstream.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/simon-anders/htseq/issues/67#issuecomment-612293629, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKWTKKFHEYPYK3526TI6HGLRL7DOTANCNFSM4GC6BK3A.
I have 12 Drosophila samples, and was trying to align sequencing data to the fly genome with STAR and to get the expression matrix using HTseq. All samples went through smoothly, except 1 in the middle, giving errors during HTseq.
Here are details: First, I ran the alignment using STAR, and all 12 samples seemed fine, the Uniquely Mapped Rates ranging 75%-85%. Then, I ran HTseq with following settings (I used this setting several time without any problems). Now, it is strange that all samples generate proper htseqout.tab, except one in the middle. The BAM file for this sample seem normal, not the largest or smallest; the Log.final.out from STAR also seems good to me. I have tried several adjustment and have re-run for a few times, but it always gave an similar error. Please help out. Thanks!
HTseq setting: htseq-count -f bam -r pos -s no -m intersection-strict --max-reads-in-buffer=90000000 /scratch/users/hongjie/scRNAseq_data/bulkRNAseq_PN_36h_VT/alignment/$sample/Aligned.sortedByCoord.out.bam /home/groups/lluo/genomes/drosophila_ERCC_gfp/annotation/dmel-all-r6.10-ERCC-Transgenes.gtf > /scratch/users/hongjie/scRNAseq_data/bulkRNAseq_PN_36h_VT/alignment/$sample/htseqout.tab)
Error: 116600000 SAM alignment record pairs processed. 116700000 SAM alignment record pairs processed. 116800000 SAM alignment record pairs processed. Error occured when processing SAM input (record #234694162 in file /scratch/users/hongjie/scRNAseq_data/bulkRNAseq_PN_36h_VT/alignment/PN_VT_four_age_G24/Aligned.sortedByCoord.out.bam): "unknown type '50'" [Exception type: KeyError, raised in libcalignedsegment.pyx:2534]