openvax / pyensembl

Python interface to access reference genome features (such as genes, transcripts, and exons) from Ensembl
Apache License 2.0
373 stars 65 forks source link

Add sequence attribute to non-coding transcripts #184

Closed murphycj closed 7 years ago

murphycj commented 7 years ago

For example, transcript ENSMUST00000141797 of GRCm38 is non-coding, so there is no sequence attribute. However, the sequence of the transcript is available in a fasta file. E.g. look under the ncRNA column here: http://useast.ensembl.org/info/data/ftp/index.html

murphycj commented 7 years ago

I' am going to see if I can patch this myself. But what was the reason (if any) for not including the sequences of the non-coding RNAs?

iskandr commented 7 years ago

Hey @murphycj, do you have any idea why Ensembl doesn't include ENSMUST00000141797 in the cdna.all FASTA? There seem to be some other non-coding biotypes in there, but I guess it's not exhaustive?

iskandr commented 7 years ago

Looking through the mouse FASTA the only non-coding transcripts I'm seeing are actually polymorphic regions and pseudogenes:

  18 transcript_biotype:IG_C_gene
   1 transcript_biotype:IG_C_pseudogene
  19 transcript_biotype:IG_D_gene
   3 transcript_biotype:IG_D_pseudogene
  14 transcript_biotype:IG_J_gene
   4 transcript_biotype:IG_LV_gene
 301 transcript_biotype:IG_V_gene
 155 transcript_biotype:IG_V_pseudogene
   2 transcript_biotype:IG_pseudogene
   2 transcript_biotype:TEC
  10 transcript_biotype:TR_C_gene
   4 transcript_biotype:TR_D_gene
  70 transcript_biotype:TR_J_gene
  10 transcript_biotype:TR_J_pseudogene
 194 transcript_biotype:TR_V_gene
  34 transcript_biotype:TR_V_pseudogene
   3 transcript_biotype:nonsense_mediated_decay
  56 transcript_biotype:polymorphic_pseudogene
6904 transcript_biotype:processed_pseudogene
 546 transcript_biotype:processed_transcript
 109 transcript_biotype:pseudogene
  27 transcript_biotype:retained_intron
 184 transcript_biotype:transcribed_processed_pseudogene
   5 transcript_biotype:transcribed_unitary_pseudogene
 181 transcript_biotype:transcribed_unprocessed_pseudogene
  20 transcript_biotype:unitary_pseudogene
2372 transcript_biotype:unprocessed_pseudogene
iskandr commented 7 years ago

Weirdly the human transcript FASTA has 1 miRNA entry:

gzcat Homo_sapiens.GRCh38.cdna.all.fa.gz | grep '>' | grep -v protein_coding | cut -f 6 -d' ' | sort | uniq -c
  28 transcript_biotype:IG_C_gene
  11 transcript_biotype:IG_C_pseudogene
  64 transcript_biotype:IG_D_gene
  24 transcript_biotype:IG_J_gene
   6 transcript_biotype:IG_J_pseudogene
 228 transcript_biotype:IG_V_gene
 295 transcript_biotype:IG_V_pseudogene
   1 transcript_biotype:IG_pseudogene
   4 transcript_biotype:TEC
   8 transcript_biotype:TR_C_gene
   5 transcript_biotype:TR_D_gene
  93 transcript_biotype:TR_J_gene
   4 transcript_biotype:TR_J_pseudogene
 161 transcript_biotype:TR_V_gene
  43 transcript_biotype:TR_V_pseudogene
   1 transcript_biotype:miRNA
   7 transcript_biotype:nonsense_mediated_decay
  95 transcript_biotype:polymorphic_pseudogene
10817 transcript_biotype:processed_pseudogene
2594 transcript_biotype:processed_transcript
  25 transcript_biotype:pseudogene
 249 transcript_biotype:retained_intron
 492 transcript_biotype:transcribed_processed_pseudogene
  62 transcript_biotype:transcribed_unitary_pseudogene
 863 transcript_biotype:transcribed_unprocessed_pseudogene
 159 transcript_biotype:unitary_pseudogene
3298 transcript_biotype:unprocessed_pseudogene
murphycj commented 7 years ago

Not sure why they're separate files, but I asked Ensembl dev mail list, so we'll see what they say (if they respond).

tavinathanson commented 7 years ago

Thanks for asking @murphycj!

murphycj commented 7 years ago

@iskandr @tavinathanson , someone posted an answer to the question.

iskandr commented 7 years ago

Latest PR closes this?

murphycj commented 7 years ago

Yes