simon-anders / htseq

HTSeq is a Python library to facilitate processing and analysis of data from high-throughput sequencing (HTS) experiments.
https://htseq.readthedocs.io/en/release_0.11.1/
GNU General Public License v3.0
122 stars 77 forks source link

Different counts: Full genome vs split genome #109

Open ReemaSingh opened 3 months ago

ReemaSingh commented 3 months ago

Hi,

I wonder if anyone in this community can help me figure out the reason behind the discrepancy in counts. Please see below:

I am using HTSeq-count to generate counts from two different types of alignments. These alignments were generated using two different types of genome indexes built from the same genome using the STAR aligner. See below the detailed process I have used to generate the counts:

Genome used: Mus_musculus.GRCm39.dna_sm.toplevel.fa

Scenario 1: Using full genome to generate genome index and counts:

  1. Here, I have generated a genome index for the entire genome (i.e., Mus_musculus.GRCm39.dna_sm.toplevel.fa)
  2. Then aligned my fastq (e.g., Sample1.fq.gz) file
  3. Generates the counts

Scenario 2: Split the genomes into separate chromosomes.

  1. ./faSplit byname Mus_musculus.GRCm39.dna_sm.toplevel.fa Genome/ - This generates 61 separate fasta sequences
  2. Then, I generated 61 genome indexes for 61 fasta sequences
  3. Aligned fastq (e.g., Sample1.fq.gz) file separately to the 61 genome indexes. This generates 61 different alignments
  4. Combined 61 bam files into one bam file.
  5. Generated the counts

Ideally, the final counts generated from both the above scenarios should be the same. But, in my case, the counts generated from Scenario 2 are much higher than Scenario 1 in a lot of IDs.

I have also tried feature-count (from the Rsubread Bioconductor package) to generate counts from the above two scenarios but again the counts are different.

I would highly appreciate any response.

Many Thanks, Reema,

simon-anders commented 3 months ago

Yes, this is expected. If a read maps to more than one chromosome, it should not be counted because it cannot be unambiguously assigned to one feature. However, if you run the alignment for each chromosome separately, the aligner cannot detect such ambiguity and cannot mark it.

ReemaSingh commented 3 months ago

Thank you, @simon-anders for your response. I understand it now.

Just one more query - What about ambiguity within a single chromosome? Is there a way to fix it?

I am asking this question because I checked the count of one ID (i.e., ENSMUSG00000020870) and I found:

  1. In separate BAM files: ENSMUSG00000020870 has counts (i.e., 1182) only in one file (i.e., Chromosome11). And with the remaining chromosomes, the count is "0". In this case, the reads are mapped only with Chrosome 11 and not with other chromosomes.
  2. In integrated BAM (all split chromosome BAM combined): ENSMUSG00000020870 count is 1182

Whereas the original count of this ID using BAM files (generated without splitting the genome) is "35".

simon-anders commented 3 months ago

If you want to get at the bottom of this, you could re-run htseq-count on your separate BAM file with the option --samout. You will get a SAM file where each alignment is annotated with how it is counted. Use grep to find the IDs of the 1182 reads assigned to your feature, pick a few of them at random, and grep them in the integrated BAM file to see whether they are multimappers there.

ReemaSingh commented 3 months ago

Thank you for your response, @simon-anders. I got the SAM output and I am currently comparing them.