markziemann / dee2

Digital Expression Explorer 2 (DEE2): a repository of uniformly processed RNA-seq data
http://dee2.io
GNU General Public License v3.0
39 stars 7 forks source link

Number of mouse transcripts in annotation #49

Open apredeus opened 5 years ago

apredeus commented 5 years ago

Hello,

I was wondering about the annotation version you were using for processing mouse experiments using Kallisto. Ensembl 90 annotation has 131,195 unique transcripts; however, the cDNA file you've used only contains 109,282. Could you tell why is that, and why some of the transcripts were dropped?

Thank you!

markziemann commented 5 years ago

Hi @apredeus , I noticed this also. It is an inconsistency between the Ensembl GTF and the cDNA file. For kallisto mapping, DEE2 uses the cDNA.

$ wget ftp://ftp.ensembl.org/pub/release-90/fasta/mus_musculus/cdna/Mus_musculus.GRCm38.cdna.all.fa.gz $ zgrep -c '>' Mus_musculus.GRCm38.cdna.all.fa.gz 109282 I'm not sure about the reasons behind the discrepancy between the two files.

apredeus commented 5 years ago

Hello @markziemann ,

so I contacted Ensembl about the clarification. Apparently there's some sort of division that Ensembl does for its annotation; cDNA is meant to mostly include protein coding transcripts. Upon closer examination that doesn't hold true either; cDNA is protein coding genes + all possible types of pseudogenes. If you look at what's actually included, here's the breakdown of "gene_type" field from the master GTF:

     22 IG_C_gene
      1 IG_C_pseudogene
     20 IG_D_gene
      3 IG_D_pseudogene
     18 IG_J_gene
      4 IG_LV_gene
      2 IG_pseudogene
    306 IG_V_gene
    155 IG_V_pseudogene
    142 polymorphic_pseudogene
   8616 processed_pseudogene
  94937 protein_coding
     74 pseudogene
    499 transcribed_processed_pseudogene
     48 transcribed_unitary_pseudogene
    655 transcribed_unprocessed_pseudogene
     11 TR_C_gene
      5 TR_D_gene
     76 TR_J_gene
     10 TR_J_pseudogene
    194 TR_V_gene
     34 TR_V_pseudogene
     18 unitary_pseudogene
   2549 unprocessed_pseudogene

What's missing is all the lincRNAs, antisense, and bunch of other small and misc RNA types; these are aggregated in a separate file at ftp://ftp.ensembl.org/pub/release-90/fasta/mus_musculus/ncrna/.

Additionally, and quite annoyingly, the cDNA file includes entities found exclusively on patches and alt-contigs. People discussed it for quite a while, and (I think) overall consensus is that it's best to use "primary" version of human/mouse assembly, together with the matching annotation.

markziemann commented 5 years ago

Thanks for investigating this Alex. In the next version of DEE we would like to include lincRNAs as well. So is it safe to concatenate the ncRNA.fa and cDNA.fa then remove any contigs not on the primary assembly?

apredeus commented 5 years ago

From my previous experience and discussions with other RNA-Seq bioinformaticians, Gencode seemed a bit better in terms of consistency and curation, while having a benefit of the same gene/transcript IDs as Ensembl. So I think it's a good idea to take the latest Gencode annotation for both human and mouse.

What we would usually do is take the so-called primary version of genome assembly (meaning reference chromosomes AND extra scaffolds, but no patches or alt-contigs since they increase ambiguity and multi-mapping), matching primary GTF, and then just use rsem-prepare-reference (from RSEM) to generate the transcript sequences exactly matching the genome/GTF.

Alternatively, Gencode has pre-extracted sequences of transcripts as well, but these do not include ones located in extra scaffolds. However, in the latest mouse version these account for only 95 out of ~ 140k, so they could probably safely be ignored :)