Inclusion of GC content?

jorainer / ensembldb

This is the ensembldb development repository.

https://jorainer.github.io/ensembldb

33 stars 10 forks source link

Inclusion of GC content? #103

Closed smped closed 4 years ago

smped commented 4 years ago

Hi, Thanks for maintaining all of this. Give that GC content is pretty commonly recalculated locally by numerous researchers, and that this will be a fixed value for a given ensembl release. Is it worth including this as one of the standard mcols() when calling transcripts() or genes() on an EnsDb object. I can imagine it being (relatively) trivial to implement whilst building packages, and would be very useful. Thanks

jorainer commented 4 years ago

Thanks for the suggestion @steveped . I'll have a look if I can extract that from the Ensembl core databases. That would be the simple and straight forward solution.

jorainer commented 4 years ago

Actually, what will be more interesting, the GC count or the GC content (i.e. GC count / transcript length)?

jorainer commented 4 years ago

I checked the GC content that I got during database creation (using perl and the Ensembl API) with the one I would get using Biostrings and the transcript sequences in R and they are identical.

What remains to do is to document the new feature.

smped commented 4 years ago

Hi @jorainer.

Thanks for chasing this so quickly! That sounds great. We tried building a few objects for our own use last week & over the weekend, but noticed a few transcripts were missing from some of the fa.gz files we pulled directly from Ensembl. They're up here (https://uofabioinformaticshub.github.io/Ensembl_GC/) but I think having these values put straight into an EnsDb object is a far better solution.

To answer your earlier question, I don't think it matters if it's a percentage GC or actual count at the transcript-level, as we can easily get each from the other just using transcript length.

jorainer commented 4 years ago

I'm providing now the GC content as a percentage. These values are calculated directly on the sequences obtained from the Ensembl core database (see code https://github.com/jorainer/ensembldb/commit/d0af53a70b58feee20e7fc2662a80bb59610ae7d#diff-1f085737cfe97764c3ca0e8138993b3bR278-R280) and stored into a column "gc_content" in the transcript table.

I'm currently re-building all EnsDb objects for release 98 and will then push these new ones to AnnotationHub - will take some more days to complete tough (generation is now also slower because of the sequence retrieval :) ). I'll let you know when everything is done.

jorainer commented 4 years ago

They are now added to AnnotationHub - see also https://support.bioconductor.org/p/126708/

Closing the issue - feel free to reopen if needed.