Closed vladimirsouza closed 1 year ago
Thanks for posting this Vladimir. I like your suggestion that the returned regions follow the same order as the genes given to the function. However, have you played around with the return_as
parameter? If set to "df" or "bed", the returned data frame will tell you what genes correspond to what regions.
> my_genes <- c("MYC", "BCL2", "imaginary_gene")
> gene_to_region(gene_symbol = my_genes, return_as = "df")
2 region(s) returned for 3 gene(s)
chromosome start end gene_name hugo_symbol ensembl_gene_id
1 18 60790579 60987361 BCL2 BCL2 ENSG00000171791
2 8 128747680 128753674 MYC MYC ENSG00000136997
The standard return format is what we refer to as a "region" format (chr:start-end). This functionality should remain since it returns the region for any given gene in a format that is accepted by other GAMBLR functions.
Some room for improvement from my perspective;
You could also have a look at the relative to this function regions_to_gene
and see how/if the same suggested improvements would apply to this function.
Seems to me that Vladimier's suggestion of ordering the regions vector using gene_name is potentially useful and the named vector may have other benefits. Returning NA for ones that don't exist should probably not be the default behaviour though.
gene_to_region
always sorts the order of the output regions by chromosome (but incorrectly;chromosome
ischaracter
) and start location. In this way, it's impossible to know to which gene each output region refers. An example:In this example, The first region
"18:60790579-60987361"
refers to"BCL2"
(second input gene), and the second region"8:128747680-128753674"
refers to"MYC"
(first gene)."imaginary_gene"
was ignored.My suggestion is that the order of the output regions follows the same order as the input genes (they should have the same lengths). Besides that, the function could output a named vector to inform the gene names as well. An example:
If useful, we can add a new boolean parameter
sort_regions
, whereTRUE
(must be specified by the user; default is FALSE) means to sort regions by their chromosome and start location (NA
s always to the right hand).If anyone finds this change useful, I would be happy in making the change and submit a Pull Request.