simon-anders / htseq

HTSeq is a Python library to facilitate processing and analysis of data from high-throughput sequencing (HTS) experiments.
https://htseq.readthedocs.io/en/release_0.11.1/
GNU General Public License v3.0
122 stars 77 forks source link

Question about backend of default option -t #83

Closed gtollefson closed 4 years ago

gtollefson commented 5 years ago

Hello,

When running htseq-count using default -t option to count reads that fall within exonic regions, does htseq make counts based upon only the first exon or the last exon for each gene that has multiple exons annotated in the GFF3 file. Or does it count all reads that align in either of the exonic regions recorded?

I am running htseq-count using a GFF3 file that has multiple exon records for a group of a genes and would like to use feature lengths to normalize my counts, but am not sure which to use when there are two exon records for one gene.

I've included a screenshot of such a record.

Screen Shot 2019-05-09 at 2 51 10 PM

iosonofabio commented 5 years ago

Usually people take all exons from each gene and sum over them, but you can separate by exon if you want.

Normalizing by feature length is often not as useful as you might think, but not forbidden either. You probably want to get an idea of which isoforms are actually used and divide by something like the average length of a transcript from any of those isoforms.

RNA-Seq can be tough at times...

fburdet commented 5 years ago

Hello,

About the default option -t exon: I am indeed using a GTF from Ensembl, so I always used this option.

Now, I realised that if I do a colSums (in R) of the reads per sample counted by htseq-count, including the "___" stats, I get about twice as many reads as the number of mapped reads. So this means that some reads are counted several times! How does it work? Does it do the sum of all exons of all transcripts, even if redundant? So if a gene has 10 isoforms, and 1 exon is common to the 10, the number counted by htseq-count for that exon is x10 for the "by gene" final output? If so, isn't that a bit dangerous, given that we have no idea if the transcripts are actually expressed, and how much...

Been looking up the web about that, didn't find an answer.

Can you please enlighten me?

Thanks in advance!

simon-anders commented 5 years ago

No, each uniquely mapped read is counted only once.

The discrepancy in the colSum stems from the multi-mappers, i.e., the count for "__alignment_not_unique": this counter counts the number of alignments, not the number of reads. Hence, reads are overcounted there.

So. the column sum over everything should equal the number of alignment lines in your SAM file. The column sum over everything except for "alignment_not_unique" and "not_aligned" should equal the number of SAM lines with unique alignments (i.e., with "NH:i:1").

fburdet commented 5 years ago

Ah yes, thanks for the answer. Makes sense. I could check it works with my data for single end data.

However, for paired end, I get a factor of about 2 (just below) between the mapped reads (log from STAR 2.6.0c) and htseq-count 0.9.1 (sum of all fields except alignment_not_unique, not_aligned is always == 0)

I'm using the Aligned.sortedByCoord bam, which I realize might not be what is requested... but I don't get any error or warning.

Any idea what's happening? Is it counting R1 and R2?

simon-anders commented 5 years ago

hsteq-count counts fragments, not reads. Typically, each fragment will have one R1 and one R2, and htseq-count checks that they both map to the same gene.

fburdet commented 5 years ago

Yes, that's what I understood, but obviously I have a problem as I have nearly twice as many counts from htseq-count as I have mapped reads from STAR. So it seems htseq-count is counting the reads, not fragments, in my case.

What am I doing wrong? It is stranded paired, so I'm using --stranded=reverse on a bam sorted by coordinates.

simon-anders commented 5 years ago

That's odd. Try sorting on read name. Also: any warnings issued by htseq-count?

fburdet commented 5 years ago

Ah yes: Warning: 60939760 reads with missing mate encountered. That's nearly twice the number of fragments (33206810)

So bad sorting?

iosonofabio commented 5 years ago

We have an open issue with bam files sorted by position, which is slightly less obvious to fix than it should be.

Please try sort by name and repeat.

On Fri, Jul 5, 2019, at 08:37, fburdet wrote:

Ah yes: Warning: 60939760 reads with missing mate encountered. That's nearly twice the number of fragments (33206810)

So bad sorting?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/simon-anders/htseq/issues/83?email_source=notifications&email_token=AAJFEAA2LYFHVD2BFS6CO7LP55TEVA5CNFSM4HL5CVJKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZJ2CCQ#issuecomment-508797194, or mute the thread https://github.com/notifications/unsubscribe-auth/AAJFEACVH66JU34A33V6MXTP55TEVANCNFSM4HL5CVJA.

fburdet commented 5 years ago

Yes, it does work after sorting by name instead of coordinate.

Thanks for your help.

iosonofabio commented 5 years ago

Thank you, that is useful indeed

On Tue, Jul 9, 2019, at 02:29, fburdet wrote:

Yes, it does work after sorting by name instead of coordinate.

Thanks for your help.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/simon-anders/htseq/issues/83?email_source=notifications&email_token=AAJFEAA7EPOASTGWB72HP6DP6RLALA5CNFSM4HL5CVJKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZPWDEQ#issuecomment-509567378, or mute the thread https://github.com/notifications/unsubscribe-auth/AAJFEADYKBMLVD4HGVATMILP6RLALANCNFSM4HL5CVJA.

iosonofabio commented 4 years ago

Closing as this is related solely to problems with the buffer when sorting by coordinate.