simon-anders / htseq

HTSeq is a Python library to facilitate processing and analysis of data from high-throughput sequencing (HTS) experiments.
https://htseq.readthedocs.io/en/release_0.11.1/
GNU General Public License v3.0
122 stars 77 forks source link

Issues with "Mate records missing" with position-sorted bam files #37

Closed mhibberd closed 4 years ago

mhibberd commented 7 years ago

I've installed htseq-count from github on our local computing cluster and am attempting to use htseq-count in conjunction with a bowtie2 with and without deduplication (using picard) to generate count profiles for predicted genes in a metagenomic assembly.

I run bowtie2 to map reads and pipe the results through samtools to generate a position-sorted bam file: bowtie2-build assembly.fna assembly.fna bowtie2 -x assembly.fna -1 reads.1.fastq -2 reads.2.fastq -p 8 | samtools view -hb | samtools sort - -o assembly.fna.pos.bam

Using a standard gtf as gene location input for each contig, I run htseq-count: htseq-count -r pos -t CDS -f bam assembly.fna.pos.bam assembly.gtf > assembly.fna.pos.count

...but each time I see a warning suggesting a VERY high number of reads without mates Warning: Mate records missing for 10524978 records; first such record: <SAM_Alignment object: Paired-end read 'NB501065:182:H53TNBGX3:2:12108:1776:2682' aligned to contig-120_2940:[1721,1767)/+>.

When I sort the bam file by query name and change the htseq-count flag to -r name things run fine bowtie2 -x assembly.fna -1 reads.1.fastq -2 reads.2.fastq -p 8 | samtools view -hb | samtools sort -n - -o assembly.fna.qname.bam

Any help is appreciated, even if this a stupid question and there's something glaringly wrong with my function calls to bowtie/samtools/htseq-count.

Thanks!

iosonofabio commented 7 years ago

Hi, There is a bug for position sorted files, for now just sort the file by name first and I'll have a look Thx Fabio

On August 17, 2017 3:18:09 PM PDT, mhibberd notifications@github.com wrote:

I've installed htseq-count from github on our local computing cluster and am attempting to use htseq-count in conjunction with a bowtie2 with and without deduplication (using picard) to generate count profiles for predicted genes in a metagenomic assembly.

I run bowtie2 to map reads and pipe the results through samtools to generate a position-sorted bam file: bowtie2-build assembly.fna assembly.fna bowtie2 -x assembly.fna -1 reads.1.fastq -2 reads.2.fastq -p 8 | samtools view -hb | samtools sort - -o assembly.fna.pos.bam

Using a standard gtf as gene location input for each contig, I run htseq-count: htseq-count -r pos -t CDS -f bam assembly.fna.pos.bam assembly.gtf > assembly.fna.pos.count

...but each time I see a warning suggesting a VERY high number of reads without mates Warning: Mate records missing for 10524978 records; first such record: <SAM_Alignment object: Paired-end read 'NB501065:182:H53TNBGX3:2:12108:1776:2682' aligned to contig-120_2940:[1721,1767)/+>.

When I sort the bam file by query name and change the htseq-count flag to -r name things run fine bowtie2 -x assembly.fna -1 reads.1.fastq -2 reads.2.fastq -p 8 | samtools view -hb | samtools sort -n - -o assembly.fna.qname.bam

Any help is appreciated, even if this a stupid question and there's something glaringly wrong with my function calls to bowtie/samtools/htseq-count.

Thanks!

-- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: https://github.com/simon-anders/htseq/issues/37

sushmitasp commented 7 years ago

Hello, I came across the exact same problem. What would you suggest? I went ahead with the name sorted bam file and got results but the problem is that I can't remove duplicates using samtools rmdup/Picard MarkDuplicates in a name sorted bam. Thanks a lot for your help!

iosonofabio commented 6 years ago

The current workaround is to keep both a name sorted and a position sorted, or switch between them.

If any of you could provide a minimal position-sorted file with proper mates that manifests the issue, that would speed up debugging a lot. In absence of that, it may take a while to fix as I have other things on my plate too...

iosonofabio commented 6 years ago

OK guys, I'm digging a bit into this and it's a decent mess.

The good news:

I added a few bells and whistles to HTSeq so now if you call htseq-count with both --secondary-alignments ignore and --supplementary-alignments ignore you should get a pretty consistent behaviour with both name and position sorted files.

The bad news:

Although multimapped reads are not critical for most applications, htseq-count has some issues with multimappers and paired-end reads.

Funny enough, it works better (but slower) on position-sorted files, despite all those warnings. The fact that htseq-count does not throw so many warnings on name-sorted files happens to be a silent "bug" rather than a sign of healthy behaviour. See below for why one needs the quotes: it's arguably not really a bug.

Detailed explanation by example

The core of the question can be exposed by an example: say in a pair read1 is multimapped to 2 places and read2 is multimapped to 3 places in the genome, and say those mappings are uncoupled, i.e. there is no obvious pairing of which of the read1 alignments go together with the read2 alignments. This example may seem extreme, but remembers genomes are messy places, even very "tidy" ones like the human genome.

Now htseq-count tries to score alignment pairs, which does not necessarily mean read pairs. The two things are the same only if both --secondary-alignments ignore and --supplementary-alignments ignore are used: in this case there is no problem (the good news above).

But if one tries to score secondary alignments for this example it is quite dificult: which of the 2 alignments for read 1 should be paired with which of the 3 alignments from read 2? This is where the BAM parsers for name vs position sorted files differ: the position sorted one tries to infer which of the mates go with which, whereas the name sorted parser pretty much takes the first read1 and pairs it with the first read2, etc. So the two parsers will yield different results.

Proposed solution

It is hard to know exactly what to do, but there are several options:

  1. implement some better heuristics for how to pair alignments in case of multimappers. This is an ill-defined problem so it won't ever work perfectly, but we could try to patch as many holes as possible.
  2. default htseq-count to ignore secondary and supplementary alignments. People will get a consistent behaviour by default, if not back-compatible
  3. force ignoring of secondary and supplementary alignments in htseq-count. Although I like this idea since it makes the software fully self-consistent, it may break a number of people's expectations.

So I'm currently leaning towards option 2, changing the default but letting people run into the thornbush if they so wish. Let me know if you have any comments, else I'll go for this choice in the next version of HTSeq, 0.11.0.

BhavanaNayer commented 5 years ago

Hi Fabio, I am still running into an error with htseq on my paired end data, that contains some singletons (generated by trimmomatic post trimming, which were then used in Hisat2 for alignment). For some samples, it manages to run for position-sorted files, but for most samples, the position-sorted bam files don't work. The name sorted-files work for some and don't work for some others - i am not sure why.

Have you found a way to get this to work?

I always keep getting this type of an error:

Error occured when processing SAM input (record #32 in file /extstor/sudhirlab/Reety/Bhavana/Set3_firstAgrigenome/SortedNAME_CFPAC-1_N1_aligned.bam):
  'pair_alignments' needs a sequence of paired-end alignments
  [Exception type: ValueError, raised in __init__.py:712]

The code I used is python2.7 -m HTSeq.scripts.count -f bam -r name -s no --secondary-alignments ignore --supplementary-alignments ignore /extstor/sudhirlab/Reety/Bhavana/Set3_firstAgrigenome/SortedNAME_CFPAC-1_N1_aligned.bam /extstor/sudhirlab/Reety/Bhavana/Indexed_Genomes/GRCh38/ChromosomeNamed_Gtf_GRCh38.gtf > /extstor/sudhirlab/Reety/Bhavana/Set3_firstAgrigenome/Counts_editedcode_SortedNAME_CFPAC-1_N1_aligned.txt

Is there some better way to get around this?

Shukla04 commented 5 years ago

Hello I have the same issue of : reads with missing mate encountered. My bam files are sorted by name using command: samtools sort -n -o sorted_uniq_hits/$f3.bam $file

And after sorting they look like (sorted alignment files):

SRR8206438.4.1 83 5 458763 50 101M80N50M = 458666 -328 GGGTAATAGAGTTTCTCAGAAGACAGGAAAAGGGTAAAAAGTACACCGCATTGGTCTTGAGATTTGTCAAAGACCGGATCGCATCCCTTTTGTTGGTTGAGGTGGGTTTTCAAGCAACGGCGTGGGTGTCAGAAGGAAAACAAGTGGGAGG FHHHGHGHIGIIHIIHHIIIIIHIIIIIIHGIHHIIIIIIHHHEGIIHHGIIHIIIIIIIIIIIIIIIIIIHHIIIIIIIHHIIIHGHIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIHFIIIIIIIIIIDDDDD AS:i:-5 XM:i:1 XO:i:0 XG:i:0 MD:Z:150A0 NM:i:1 XS:A:+ NH:i:1 SRR8206438.4.2 163 5 458666 50 150M = 458763 328 TTTCCATTCTCTGCTGGTGAGTTAGAAGGTATAGCAGCATCCGTAAACATGCAGAGCAAAGTGGTGAGAAAACTCTCAAACACCGGTCTCCGCTATTGGGTAATAGAGTTTCTCAGAAGACAGGAAAAGGGTAAAAAGTACACCGCATTG DDDDDIIIIIIIIHHIIGHHHIIIIHIIIIIIIIIIIIIIIHIIIIIIIIIIHIIIIIHIIHHIIIIIIIIIIIIIIIIIIIIIIHIIIIIIIIIIIIIIHDHHIIIIHIIIHIIIGIIIIIIHIIIHGHIIGIIIIHHIHIGEDHHIIE AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:150 YT:Z:UU NH:i:1 SRR8206438.5.1 99 1 21624037 50 151M = 21624047 161 TTGTTGTTCCGCCATTGCCTTAAGCTTTTCCTCCAAAATGTCTTTCTCAAGACCGTACATTCTGTTCAAAGCATGCATATTCTCAAGGGTCTCGCTCAGCGTTCTACTGTCACTTCCGTCACATCTCAAATGCTGCATTGGACCATGTAAG @BDDDIIIIIIIIIIGIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIHIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIHIGIIGHGHIGHIIIIIIIIIHIIIIIIIIIIIIIIIIIGIIIIIHHHIIIIIIC AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:151 YT:Z:UU NH:i:1 SRR8206438.5.2 147 1 21624047 50 151M = 21624037 -161 GCCATTGCCTTAAGCTTTTCCTCCAAAATGTCTTTCTCAAGACCGTACATTCTGTTCAAAGCATGCATATTCTCAAGGGTCTCGCTCAGCGTTCTACTGTCACTTCCGTCACATCTCAAATGCTGCATTGGACCATGTAAGAATCTGTTCA CHIHHHHHGHEIIIHIIIIHHEDHFHHHIHIIIIIHIIIIIHIIIIIHHFIIIHHIIGIIHIHIIIIHIIIHIIIIIHIIIHIIIIIIIIGIIHIIIIIIIIIIIIIIIIIIIIIIIIIIIHHIIIGIIIIIIIIIIIIIIIIIIIDDDDD AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:151 YT:Z:UU NH:i:1

On these name sorted files, I am running htseq-count using command:

htseq-count -f bam -r name -s yes -t gene -i gene_id -m union --secondary-alignments ignore --supplementary-alignments ignore $file $genome/Arabidopsis_thaliana.TAIR10.42.gff3 > htseq_genecount/$f3.txt

I get the same error all the time even when my alignment files are perfectly sorted, htseq is not able to count 90% of the data.

A warning message or Error looks like:

700000 GFF lines processed. 791564 GFF lines processed. Warning: Read SRR8206438.4.1 claims to have an aligned mate which could not be found in an adjacent line. 100000 SAM alignment record pairs processed. 200000 SAM alignment record pairs processed.............

And at the end:

24800000 SAM alignment record pairs processed. 24900000 SAM alignment record pairs processed. 25000000 SAM alignment record pairs processed. Warning: 24289372 reads with missing mate encountered. 25089209 SAM alignment pairs processed.

Although in bam file, SRR8206438.4.1 read has the mate in adjacent line. I still get the error.

Can someone suggest me a solution. How to fix it?

bounlu commented 4 years ago

I have the same problem as of today. Any permanent fix yet?

iosonofabio commented 4 years ago

Il have a go at it tomorrow

On Mon, Mar 23, 2020, at 18:34, Ömer An wrote:

I have the same problem. Any solution yet?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/simon-anders/htseq/issues/37#issuecomment-602431464, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJFEACR2QNEDUJ2IJSQHMLRI4GJJANCNFSM4DXLQU2Q.

iosonofabio commented 4 years ago

This issue has been a mix of old HTSeq bugs, people using corrupted files, and other things all tangled. The official repo is now https://github.com/htseq/htseq. Please open the issue there again with a link to an example (including BAM and GTF files) and I'll have a look. Thank you