Open apredeus opened 5 years ago
Hi @apredeus , I noticed this also. It is an inconsistency between the Ensembl GTF and the cDNA file. For kallisto mapping, DEE2 uses the cDNA.
$ wget ftp://ftp.ensembl.org/pub/release-90/fasta/mus_musculus/cdna/Mus_musculus.GRCm38.cdna.all.fa.gz
$ zgrep -c '>' Mus_musculus.GRCm38.cdna.all.fa.gz
109282
I'm not sure about the reasons behind the discrepancy between the two files.
Hello @markziemann ,
so I contacted Ensembl about the clarification. Apparently there's some sort of division that Ensembl does for its annotation; cDNA is meant to mostly include protein coding transcripts. Upon closer examination that doesn't hold true either; cDNA is protein coding genes + all possible types of pseudogenes. If you look at what's actually included, here's the breakdown of "gene_type" field from the master GTF:
22 IG_C_gene
1 IG_C_pseudogene
20 IG_D_gene
3 IG_D_pseudogene
18 IG_J_gene
4 IG_LV_gene
2 IG_pseudogene
306 IG_V_gene
155 IG_V_pseudogene
142 polymorphic_pseudogene
8616 processed_pseudogene
94937 protein_coding
74 pseudogene
499 transcribed_processed_pseudogene
48 transcribed_unitary_pseudogene
655 transcribed_unprocessed_pseudogene
11 TR_C_gene
5 TR_D_gene
76 TR_J_gene
10 TR_J_pseudogene
194 TR_V_gene
34 TR_V_pseudogene
18 unitary_pseudogene
2549 unprocessed_pseudogene
What's missing is all the lincRNAs, antisense, and bunch of other small and misc RNA types; these are aggregated in a separate file at ftp://ftp.ensembl.org/pub/release-90/fasta/mus_musculus/ncrna/.
Additionally, and quite annoyingly, the cDNA file includes entities found exclusively on patches and alt-contigs. People discussed it for quite a while, and (I think) overall consensus is that it's best to use "primary" version of human/mouse assembly, together with the matching annotation.
Thanks for investigating this Alex. In the next version of DEE we would like to include lincRNAs as well. So is it safe to concatenate the ncRNA.fa and cDNA.fa then remove any contigs not on the primary assembly?
From my previous experience and discussions with other RNA-Seq bioinformaticians, Gencode seemed a bit better in terms of consistency and curation, while having a benefit of the same gene/transcript IDs as Ensembl. So I think it's a good idea to take the latest Gencode annotation for both human and mouse.
What we would usually do is take the so-called primary version of genome assembly (meaning reference chromosomes AND extra scaffolds, but no patches or alt-contigs since they increase ambiguity and multi-mapping), matching primary GTF, and then just use rsem-prepare-reference (from RSEM) to generate the transcript sequences exactly matching the genome/GTF.
Alternatively, Gencode has pre-extracted sequences of transcripts as well, but these do not include ones located in extra scaffolds. However, in the latest mouse version these account for only 95 out of ~ 140k, so they could probably safely be ignored :)
Hello,
I was wondering about the annotation version you were using for processing mouse experiments using Kallisto. Ensembl 90 annotation has 131,195 unique transcripts; however, the cDNA file you've used only contains 109,282. Could you tell why is that, and why some of the transcripts were dropped?
Thank you!