Closed murphycj closed 7 years ago
I' am going to see if I can patch this myself. But what was the reason (if any) for not including the sequences of the non-coding RNAs?
Hey @murphycj, do you have any idea why Ensembl doesn't include ENSMUST00000141797 in the cdna.all
FASTA? There seem to be some other non-coding biotypes in there, but I guess it's not exhaustive?
Looking through the mouse FASTA the only non-coding transcripts I'm seeing are actually polymorphic regions and pseudogenes:
18 transcript_biotype:IG_C_gene
1 transcript_biotype:IG_C_pseudogene
19 transcript_biotype:IG_D_gene
3 transcript_biotype:IG_D_pseudogene
14 transcript_biotype:IG_J_gene
4 transcript_biotype:IG_LV_gene
301 transcript_biotype:IG_V_gene
155 transcript_biotype:IG_V_pseudogene
2 transcript_biotype:IG_pseudogene
2 transcript_biotype:TEC
10 transcript_biotype:TR_C_gene
4 transcript_biotype:TR_D_gene
70 transcript_biotype:TR_J_gene
10 transcript_biotype:TR_J_pseudogene
194 transcript_biotype:TR_V_gene
34 transcript_biotype:TR_V_pseudogene
3 transcript_biotype:nonsense_mediated_decay
56 transcript_biotype:polymorphic_pseudogene
6904 transcript_biotype:processed_pseudogene
546 transcript_biotype:processed_transcript
109 transcript_biotype:pseudogene
27 transcript_biotype:retained_intron
184 transcript_biotype:transcribed_processed_pseudogene
5 transcript_biotype:transcribed_unitary_pseudogene
181 transcript_biotype:transcribed_unprocessed_pseudogene
20 transcript_biotype:unitary_pseudogene
2372 transcript_biotype:unprocessed_pseudogene
Weirdly the human transcript FASTA has 1 miRNA entry:
gzcat Homo_sapiens.GRCh38.cdna.all.fa.gz | grep '>' | grep -v protein_coding | cut -f 6 -d' ' | sort | uniq -c
28 transcript_biotype:IG_C_gene
11 transcript_biotype:IG_C_pseudogene
64 transcript_biotype:IG_D_gene
24 transcript_biotype:IG_J_gene
6 transcript_biotype:IG_J_pseudogene
228 transcript_biotype:IG_V_gene
295 transcript_biotype:IG_V_pseudogene
1 transcript_biotype:IG_pseudogene
4 transcript_biotype:TEC
8 transcript_biotype:TR_C_gene
5 transcript_biotype:TR_D_gene
93 transcript_biotype:TR_J_gene
4 transcript_biotype:TR_J_pseudogene
161 transcript_biotype:TR_V_gene
43 transcript_biotype:TR_V_pseudogene
1 transcript_biotype:miRNA
7 transcript_biotype:nonsense_mediated_decay
95 transcript_biotype:polymorphic_pseudogene
10817 transcript_biotype:processed_pseudogene
2594 transcript_biotype:processed_transcript
25 transcript_biotype:pseudogene
249 transcript_biotype:retained_intron
492 transcript_biotype:transcribed_processed_pseudogene
62 transcript_biotype:transcribed_unitary_pseudogene
863 transcript_biotype:transcribed_unprocessed_pseudogene
159 transcript_biotype:unitary_pseudogene
3298 transcript_biotype:unprocessed_pseudogene
Not sure why they're separate files, but I asked Ensembl dev mail list, so we'll see what they say (if they respond).
Thanks for asking @murphycj!
@iskandr @tavinathanson , someone posted an answer to the question.
Latest PR closes this?
Yes
For example, transcript ENSMUST00000141797 of GRCm38 is non-coding, so there is no sequence attribute. However, the sequence of the transcript is available in a fasta file. E.g. look under the ncRNA column here: http://useast.ensembl.org/info/data/ftp/index.html