openvax / pyensembl

Python interface to access reference genome features (such as genes, transcripts, and exons) from Ensembl
Apache License 2.0
380 stars 65 forks source link

Support all protein coding biotypes #169

Open joaoe opened 8 years ago

joaoe commented 8 years ago

Currently, after the biotype cleanup, only the biotype "protein_coding" is used in the check in Transcript.is_protein_coding().

Looking at this list http://www.ensembl.org/Help/Glossary?id=275 confuses me a bit, since nontranslating_CDS or polymorphic_pseudogene are included.

Perhaps the list in Transcript.is_protein_coding() should be extended to include IG_gene, TR_gene, non_stop_decay, nonsense_mediated_decay and protein_coding ?

iskandr commented 8 years ago

The trouble with BCR and TCR genes is that they don't actually code for anything before recombination by Rag. The types which seem possibly more interesting for effect prediction are (1) non-stop decay & NMD genes, (2) polymorphic pseudogenes. However, they're both trickier to handle since you're probably predicting an effect which won't manifest in much or any actual protein product.

joaoe commented 8 years ago

which won't manifest in much or any actual protein product.

The same could be said for regular coding transcripts that are not expressed :) Perhaps if something is added by https://github.com/hammerlab/varcode/issues/195 then this issue can be ignored.