jorainer / ensembldb

This is the ensembldb development repository.
https://jorainer.github.io/ensembldb
33 stars 10 forks source link

Include transcript/protein version in the database #89

Closed ccwang002 closed 5 years ago

ccwang002 commented 5 years ago

I was using the EnsDb database of Ensembl release 90 from AnnotationHub AH57757, and I was wondering if EnsDb can include the transcript version in the database as well.

For example, there are 4 transcripts associated with a human gene GATA3,

> edb <- EnsDb('EnsDb.Hsapiens.v90.sqlite')
> transcripts(edb, filter = ~ gene_name == "GATA3")[, c('tx_id', 'gene_name')]
GRanges object with 4 ranges and 2 metadata columns:
                  seqnames          ranges strand |           tx_id   gene_name
                     <Rle>       <IRanges>  <Rle> |     <character> <character>
  ENST00000481743       10 8053604-8055553      + | ENST00000481743       GATA3
  ENST00000379328       10 8054693-8075198      + | ENST00000379328       GATA3
  ENST00000346208       10 8054806-8074890      + | ENST00000346208       GATA3
  ENST00000461472       10 8058399-8074064      + | ENST00000461472       GATA3
  -------
  seqinfo: 1 sequence from GRCh38 genome

Instead of just having the transcript ID likeENST00000481743 and ENST00000379328, it would be nice to have an option to display the transcript version as well, like ENST00000481743.2 and ENST00000379328.8. Sometimes it is quite helpful to have the full version of the transcript so when a project involves multiple versions of Ensembl annotation, it is easier to tell if any transcript annotation has changed. Otherwise, the user has to go back to the transcript GTF to retrieve that information.

Thanks again for making this tool.

jorainer commented 5 years ago

Thanks for your feedback @ccwang002 . I am not storing the version information for the transcripts (and genes, exons etc) in the EnsDb databases because they should be fixed/constant for the same Ensembl release. I thought that having different EnsDb databases for different Ensembl version would suffice (hence skipping the transcript versions).

If you really require that information I could an additional column to the database. I would however then have to update also all EnsDb databases in AnnotationHub (just to explain why I am hesitant).

jorainer commented 5 years ago

If we would add this we would have to be consistent and add also the gene_id_version. So:

In the Perl API we would have to use the ->stable_id_version() method to extract the respective ID with version appended.

jorainer commented 5 years ago

OK, so I will implement this.

jorainer commented 5 years ago

Done - I've to create some EnsDbs first to check if it works. Then I can go ahead to re-create all EnsDb databases from AnnotationHub - most likely I will just do it (first) for Ensembl version 94.

jorainer commented 5 years ago

Updating the EnsDbs on AnnotationHub:

jorainer commented 5 years ago

@ccwang002 , for the (checked) versions above I have already uploaded updated EnsDb databases to AnnotationHub. You should be able to use them right away. If you use these databases you will get the additional columns tx_id_version and gene_id_version by default with the genes, transcripts, ... calls. You don't need to update ensembldb for that.

ccwang002 commented 5 years ago

@jotsetung Thank you very much for your help! I was able to get the id versions from the new EnsDbs.

By the way, great work for maintaining and developing ensembldb. It is easy to use and powerful.

jorainer commented 5 years ago

Just an update: I've updated the EnsDb for Ensembl versions 90 to 94 hosted on AnnotationHub. All these contain now also the versioned gene and transcript IDs.