Closed mhibberd closed 4 years ago
Hi, There is a bug for position sorted files, for now just sort the file by name first and I'll have a look Thx Fabio
On August 17, 2017 3:18:09 PM PDT, mhibberd notifications@github.com wrote:
I've installed htseq-count from github on our local computing cluster and am attempting to use htseq-count in conjunction with a bowtie2 with and without deduplication (using picard) to generate count profiles for predicted genes in a metagenomic assembly.
I run bowtie2 to map reads and pipe the results through samtools to generate a position-sorted bam file: bowtie2-build assembly.fna assembly.fna bowtie2 -x assembly.fna -1 reads.1.fastq -2 reads.2.fastq -p 8 | samtools view -hb | samtools sort - -o assembly.fna.pos.bam
Using a standard gtf as gene location input for each contig, I run htseq-count: htseq-count -r pos -t CDS -f bam assembly.fna.pos.bam assembly.gtf > assembly.fna.pos.count
...but each time I see a warning suggesting a VERY high number of reads without mates Warning: Mate records missing for 10524978 records; first such record: <SAM_Alignment object: Paired-end read 'NB501065:182:H53TNBGX3:2:12108:1776:2682' aligned to contig-120_2940:[1721,1767)/+>.
When I sort the bam file by query name and change the htseq-count flag to -r name things run fine bowtie2 -x assembly.fna -1 reads.1.fastq -2 reads.2.fastq -p 8 | samtools view -hb | samtools sort -n - -o assembly.fna.qname.bam
Any help is appreciated, even if this a stupid question and there's something glaringly wrong with my function calls to bowtie/samtools/htseq-count.
Thanks!
-- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: https://github.com/simon-anders/htseq/issues/37
Hello, I came across the exact same problem. What would you suggest? I went ahead with the name sorted bam file and got results but the problem is that I can't remove duplicates using samtools rmdup/Picard MarkDuplicates in a name sorted bam. Thanks a lot for your help!
The current workaround is to keep both a name sorted and a position sorted, or switch between them.
If any of you could provide a minimal position-sorted file with proper mates that manifests the issue, that would speed up debugging a lot. In absence of that, it may take a while to fix as I have other things on my plate too...
OK guys, I'm digging a bit into this and it's a decent mess.
I added a few bells and whistles to HTSeq so now if you call htseq-count
with both --secondary-alignments ignore
and --supplementary-alignments ignore
you should get a pretty consistent behaviour with both name and position sorted files.
Although multimapped reads are not critical for most applications, htseq-count
has some issues with multimappers and paired-end reads.
Funny enough, it works better (but slower) on position-sorted files, despite all those warnings. The fact that htseq-count
does not throw so many warnings on name-sorted files happens to be a silent "bug" rather than a sign of healthy behaviour. See below for why one needs the quotes: it's arguably not really a bug.
The core of the question can be exposed by an example: say in a pair read1 is multimapped to 2 places and read2 is multimapped to 3 places in the genome, and say those mappings are uncoupled, i.e. there is no obvious pairing of which of the read1 alignments go together with the read2 alignments. This example may seem extreme, but remembers genomes are messy places, even very "tidy" ones like the human genome.
Now htseq-count
tries to score alignment pairs, which does not necessarily mean read pairs. The two things are the same only if both --secondary-alignments ignore
and --supplementary-alignments ignore
are used: in this case there is no problem (the good news above).
But if one tries to score secondary alignments for this example it is quite dificult: which of the 2 alignments for read 1 should be paired with which of the 3 alignments from read 2? This is where the BAM parsers for name vs position sorted files differ: the position sorted one tries to infer which of the mates go with which, whereas the name sorted parser pretty much takes the first read1 and pairs it with the first read2, etc. So the two parsers will yield different results.
It is hard to know exactly what to do, but there are several options:
htseq-count
to ignore secondary and supplementary alignments. People will get a consistent behaviour by default, if not back-compatiblehtseq-count
. Although I like this idea since it makes the software fully self-consistent, it may break a number of people's expectations.So I'm currently leaning towards option 2, changing the default but letting people run into the thornbush if they so wish. Let me know if you have any comments, else I'll go for this choice in the next version of HTSeq, 0.11.0.
Hi Fabio, I am still running into an error with htseq on my paired end data, that contains some singletons (generated by trimmomatic post trimming, which were then used in Hisat2 for alignment). For some samples, it manages to run for position-sorted files, but for most samples, the position-sorted bam files don't work. The name sorted-files work for some and don't work for some others - i am not sure why.
Have you found a way to get this to work?
I always keep getting this type of an error:
Error occured when processing SAM input (record #32 in file /extstor/sudhirlab/Reety/Bhavana/Set3_firstAgrigenome/SortedNAME_CFPAC-1_N1_aligned.bam):
'pair_alignments' needs a sequence of paired-end alignments
[Exception type: ValueError, raised in __init__.py:712]
The code I used is
python2.7 -m HTSeq.scripts.count -f bam -r name -s no --secondary-alignments ignore --supplementary-alignments ignore /extstor/sudhirlab/Reety/Bhavana/Set3_firstAgrigenome/SortedNAME_CFPAC-1_N1_aligned.bam /extstor/sudhirlab/Reety/Bhavana/Indexed_Genomes/GRCh38/ChromosomeNamed_Gtf_GRCh38.gtf > /extstor/sudhirlab/Reety/Bhavana/Set3_firstAgrigenome/Counts_editedcode_SortedNAME_CFPAC-1_N1_aligned.txt
Is there some better way to get around this?
Hello I have the same issue of : reads with missing mate encountered. My bam files are sorted by name using command: samtools sort -n -o sorted_uniq_hits/$f3.bam $file
And after sorting they look like (sorted alignment files):
SRR8206438.4.1 83 5 458763 50 101M80N50M = 458666 -328 GGGTAATAGAGTTTCTCAGAAGACAGGAAAAGGGTAAAAAGTACACCGCATTGGTCTTGAGATTTGTCAAAGACCGGATCGCATCCCTTTTGTTGGTTGAGGTGGGTTTTCAAGCAACGGCGTGGGTGTCAGAAGGAAAACAAGTGGGAGG FHHHGHGHIGIIHIIHHIIIIIHIIIIIIHGIHHIIIIIIHHHEGIIHHGIIHIIIIIIIIIIIIIIIIIIHHIIIIIIIHHIIIHGHIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIHFIIIIIIIIIIDDDDD AS:i:-5 XM:i:1 XO:i:0 XG:i:0 MD:Z:150A0 NM:i:1 XS:A:+ NH:i:1 SRR8206438.4.2 163 5 458666 50 150M = 458763 328 TTTCCATTCTCTGCTGGTGAGTTAGAAGGTATAGCAGCATCCGTAAACATGCAGAGCAAAGTGGTGAGAAAACTCTCAAACACCGGTCTCCGCTATTGGGTAATAGAGTTTCTCAGAAGACAGGAAAAGGGTAAAAAGTACACCGCATTG DDDDDIIIIIIIIHHIIGHHHIIIIHIIIIIIIIIIIIIIIHIIIIIIIIIIHIIIIIHIIHHIIIIIIIIIIIIIIIIIIIIIIHIIIIIIIIIIIIIIHDHHIIIIHIIIHIIIGIIIIIIHIIIHGHIIGIIIIHHIHIGEDHHIIE AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:150 YT:Z:UU NH:i:1 SRR8206438.5.1 99 1 21624037 50 151M = 21624047 161 TTGTTGTTCCGCCATTGCCTTAAGCTTTTCCTCCAAAATGTCTTTCTCAAGACCGTACATTCTGTTCAAAGCATGCATATTCTCAAGGGTCTCGCTCAGCGTTCTACTGTCACTTCCGTCACATCTCAAATGCTGCATTGGACCATGTAAG @BDDDIIIIIIIIIIGIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIHIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIHIGIIGHGHIGHIIIIIIIIIHIIIIIIIIIIIIIIIIIGIIIIIHHHIIIIIIC AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:151 YT:Z:UU NH:i:1 SRR8206438.5.2 147 1 21624047 50 151M = 21624037 -161 GCCATTGCCTTAAGCTTTTCCTCCAAAATGTCTTTCTCAAGACCGTACATTCTGTTCAAAGCATGCATATTCTCAAGGGTCTCGCTCAGCGTTCTACTGTCACTTCCGTCACATCTCAAATGCTGCATTGGACCATGTAAGAATCTGTTCA CHIHHHHHGHEIIIHIIIIHHEDHFHHHIHIIIIIHIIIIIHIIIIIHHFIIIHHIIGIIHIHIIIIHIIIHIIIIIHIIIHIIIIIIIIGIIHIIIIIIIIIIIIIIIIIIIIIIIIIIIHHIIIGIIIIIIIIIIIIIIIIIIIDDDDD AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:151 YT:Z:UU NH:i:1
On these name sorted files, I am running htseq-count using command:
htseq-count -f bam -r name -s yes -t gene -i gene_id -m union --secondary-alignments ignore --supplementary-alignments ignore $file $genome/Arabidopsis_thaliana.TAIR10.42.gff3 > htseq_genecount/$f3.txt
I get the same error all the time even when my alignment files are perfectly sorted, htseq is not able to count 90% of the data.
A warning message or Error looks like:
700000 GFF lines processed. 791564 GFF lines processed. Warning: Read SRR8206438.4.1 claims to have an aligned mate which could not be found in an adjacent line. 100000 SAM alignment record pairs processed. 200000 SAM alignment record pairs processed.............
And at the end:
24800000 SAM alignment record pairs processed. 24900000 SAM alignment record pairs processed. 25000000 SAM alignment record pairs processed. Warning: 24289372 reads with missing mate encountered. 25089209 SAM alignment pairs processed.
Can someone suggest me a solution. How to fix it?
I have the same problem as of today. Any permanent fix yet?
Il have a go at it tomorrow
On Mon, Mar 23, 2020, at 18:34, Ömer An wrote:
I have the same problem. Any solution yet?
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/simon-anders/htseq/issues/37#issuecomment-602431464, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJFEACR2QNEDUJ2IJSQHMLRI4GJJANCNFSM4DXLQU2Q.
This issue has been a mix of old HTSeq bugs, people using corrupted files, and other things all tangled. The official repo is now https://github.com/htseq/htseq. Please open the issue there again with a link to an example (including BAM and GTF files) and I'll have a look. Thank you
I've installed htseq-count from github on our local computing cluster and am attempting to use htseq-count in conjunction with a bowtie2 with and without deduplication (using picard) to generate count profiles for predicted genes in a metagenomic assembly.
I run bowtie2 to map reads and pipe the results through samtools to generate a position-sorted bam file: bowtie2-build assembly.fna assembly.fna bowtie2 -x assembly.fna -1 reads.1.fastq -2 reads.2.fastq -p 8 | samtools view -hb | samtools sort - -o assembly.fna.pos.bam
Using a standard gtf as gene location input for each contig, I run htseq-count: htseq-count -r pos -t CDS -f bam assembly.fna.pos.bam assembly.gtf > assembly.fna.pos.count
...but each time I see a warning suggesting a VERY high number of reads without mates Warning: Mate records missing for 10524978 records; first such record: <SAM_Alignment object: Paired-end read 'NB501065:182:H53TNBGX3:2:12108:1776:2682' aligned to contig-120_2940:[1721,1767)/+>.
When I sort the bam file by query name and change the htseq-count flag to -r name things run fine bowtie2 -x assembly.fna -1 reads.1.fastq -2 reads.2.fastq -p 8 | samtools view -hb | samtools sort -n - -o assembly.fna.qname.bam
Any help is appreciated, even if this a stupid question and there's something glaringly wrong with my function calls to bowtie/samtools/htseq-count.
Thanks!