Closed smped closed 4 years ago
Thanks for the suggestion @steveped . I'll have a look if I can extract that from the Ensembl core databases. That would be the simple and straight forward solution.
Actually, what will be more interesting, the GC count or the GC content (i.e. GC count / transcript length)?
I checked the GC content that I got during database creation (using perl and the Ensembl API) with the one I would get using Biostrings
and the transcript sequences in R and they are identical.
What remains to do is to document the new feature.
Hi @jorainer.
Thanks for chasing this so quickly! That sounds great. We tried building a few objects for our own use last week & over the weekend, but noticed a few transcripts were missing from some of the fa.gz files we pulled directly from Ensembl. They're up here (https://uofabioinformaticshub.github.io/Ensembl_GC/) but I think having these values put straight into an EnsDb object is a far better solution.
To answer your earlier question, I don't think it matters if it's a percentage GC or actual count at the transcript-level, as we can easily get each from the other just using transcript length.
I'm providing now the GC content as a percentage. These values are calculated directly on the sequences obtained from the Ensembl core database (see code https://github.com/jorainer/ensembldb/commit/d0af53a70b58feee20e7fc2662a80bb59610ae7d#diff-1f085737cfe97764c3ca0e8138993b3bR278-R280) and stored into a column "gc_content"
in the transcript table.
I'm currently re-building all EnsDb
objects for release 98 and will then push these new ones to AnnotationHub
- will take some more days to complete tough (generation is now also slower because of the sequence retrieval :) ). I'll let you know when everything is done.
They are now added to AnnotationHub
- see also https://support.bioconductor.org/p/126708/
Closing the issue - feel free to reopen if needed.
Hi, Thanks for maintaining all of this. Give that GC content is pretty commonly recalculated locally by numerous researchers, and that this will be a fixed value for a given ensembl release. Is it worth including this as one of the standard
mcols()
when callingtranscripts()
orgenes()
on anEnsDb
object. I can imagine it being (relatively) trivial to implement whilst building packages, and would be very useful. Thanks