Open jorainer opened 1 year ago
Thanks! Also thanks for making the package and huge thanks for making this great resource available in such a reasonable format!
As a general note: the EnsDbs would also provide protein annotations (amino acid sequences, (functional) protein domains and some mapping to Uniprot identifiers).
Yup, we've actually already got some of this working right now (using some stuff in #31)
import genomic_features as gf
ensdb = gf.ensembl.annotation("Hsapiens", "108")
ensdb.genes(
cols=["gene_id", "gene_name", "tx_id", "uniprot_id"],
filter=(gf.filters.CanonicalFilter() & gf.filters.GeneBioTypeFilter("protein_coding"))
).head()
gene_id gene_name tx_id uniprot_id gene_biotype tx_is_canonical
0 ENSG00000000003 TSPAN6 ENST00000373020 O43657.176 protein_coding 1
1 ENSG00000000005 TNMD ENST00000373031 Q9H2S6.148 protein_coding 1
2 ENSG00000000419 DPM1 ENST00000371588 A0A0S2Z4Y5.30 protein_coding 1
3 ENSG00000000419 DPM1 ENST00000371588 O60762.199 protein_coding 1
4 ENSG00000000457 SCYL3 ENST00000367771 Q8IZE3.171 protein_coding 1
I have to confess that I'm not terribly familiar with common uses or access patterns for this protein information, so any tips would be appreciated!
Also, please don't hesitate to ask if something about the EnsDb database layouts is unclear or if you have feature requests.
Honestly, so far it has been super self explanatory and easy to figure out.
I would be interested in knowing if you had any plans for schema updates, or anything like that we should be aware of.
There was one thing that came up during our testing that I'd like to request. In #16, we saw that there were some versions of ensembl missing from annotation hub, that were instead bundled with the bioconductor annotation packages.
Would it be possible/ easy to upload these versions to AnnotationHub?
I have to confess that I'm not terribly familiar with common uses or access patterns for this protein information, so any tips would be appreciated!
As long as Ensembl IDs are used (ENSG..., ENST, ..., ENSP...) all is pretty simple. There is (AFAIK) one ensembl protein ID (ENSP...) assigned to one transcript ID (ENST...) - so straight forward 1:1 mapping. With Uniprot it tricky, because there is no 1:1 mapping between Ensembl proteins and Uniprot. One Ensembl protein can be annotated to none, one or multiple Uniprot IDs... so, if possible, try to do the joins based on Ensembl IDs.
I would be interested in knowing if you had any plans for schema updates, or anything like that we should be aware of.
No schema changes planned - in the past I also tried to keep the main schema the same and just add e.g. new columns to individual tables.
Regarding missing Ensembl versions - it would be possible to create EnsDb
databases for these, but I would only do that eventually for some specific releases to not add too many databases to AnnotationHub
that might eventually not even be used...
Description of feature
Dear developers! Great work you're doing!
Just to introduce myself: I'm the developer of the
ensembldb
package and am maintaining and adding newEnsDb
databases to Bioconductor'sAnnotationHub
for each new Ensembl release.As a general note: the
EnsDb
s would also provide protein annotations (amino acid sequences, (functional) protein domains and some mapping to Uniprot identifiers). These things are not generalGenomicFeatures
features, but more specific to theensembldb
package - and it allows also to map directly between positions within the protein to the transcript to the genome (and vice versa).Also, please don't hesitate to ask if something about the
EnsDb
database layouts is unclear or if you have feature requests.