jokergoo / rGREAT

GREAT Analysis - Functional Enrichment on Genomic Regions
https://jokergoo.github.io/rGREAT
Other
84 stars 12 forks source link

[Request] Support for genomes beyond GenomeInfoDb's seqlevelsStyle #41

Open asgeissler opened 1 year ago

asgeissler commented 1 year ago

Thank you for this helpful package. I noticed that the package has a fairly strong integration with GenomeInfoDb. (From the perspective of Bioconductor data structure integration that's great.) But, the limited number of supported seqlevelsStyle/genomeStyles in GenomeInfoDb for "only" human, mouse, aso, makes it difficult to use the rGREAT package for "exotic" genomes (such as bacteria).

Even when manually preparing the arguments gr, gene_sets, extended_tss, background as genomic ranges, the "local" great function is currently hard coded to query for the seqlevelsStyle of the input, eg in the line

seqlevelsStyle(extended_tss) = seqlevelsStyle(gr)[1]

Which of course fails for my bacterial genome input

Error in seqlevelsStyle(seqlevels) : 
  The style does not have a compatible entry for the species supported by Seqname. Please see genomeStyles() for supported species/style

Info: The seqlevels are needed to query for the genome length to build the larger background.

...
sl = seqlengths(extended_tss)
if(any(is.na(sl))) {
stop_wrap("`extended_tss` must have `seqlengths` been set.")
}
gr_genome = GRanges(seqnames = names(sl), ranges = IRanges(1, sl))
...

My current solution is to copy/paste/adapt the function code of great to avoid the seqlevels and manually provide the genome lengths. However, my code is a bit ad hoc and breaks of course the seqlevelsStyle support.

Thus, my request: Would it be possible to refactor the function great to allow for custom genomes? For instance, the math logic of the function (starting at the foreach) could be a separate function. Since the various if statements in the first half of the function prepare the data structure for the main computation, a separate function might allow for custom wrappers.

asgeissler commented 1 year ago

For completeness, here is my current workaround. My approach was to remove most of the if statements in the first part of the function to fit my purpose. Then I added a new argument that provides custom genome length values for the sl variable, instead of the sl = seqlengths(extended_tss) call.

https://gist.github.com/asgeissler/124e89d9fb9e2ae9ae30597e0048dc62

jokergoo commented 6 months ago

Hi, sorry for the very late reply. Another user inquiry leads me here :)

The design of the great() function is to let it be general to any organism with any format of the genome. Maybe I missed something. I will have a check and reply to you soon.

For the behaviour of seqlevelsStyle(), my original understanding is that if the genome is not supported, it does nothing. Maybe the default behaviour has been changed in newer bioc versions to throw an error? I will also check that.

jokergoo commented 6 months ago

I have fixed the bug of the error from seqlevelsStyle().