morinlab / GAMBLR

Set of standardized functions to operate with genomic data
MIT License
4 stars 2 forks source link

`gene_to_region` function changes input order of genes #213

Closed vladimirsouza closed 1 year ago

vladimirsouza commented 1 year ago

gene_to_region always sorts the order of the output regions by chromosome (but incorrectly; chromosome is character) and start location. In this way, it's impossible to know to which gene each output region refers. An example:

> my_genes <- c("MYC", "BCL2", "imaginary_gene")
> gene_to_region(gene_symbol = my_genes)
# 2 region(s) returned for 2 gene(s)
# [1] "18:60790579-60987361"  "8:128747680-128753674"

In this example, The first region "18:60790579-60987361" refers to "BCL2" (second input gene), and the second region "8:128747680-128753674" refers to "MYC" (first gene). "imaginary_gene" was ignored.

My suggestion is that the order of the output regions follows the same order as the input genes (they should have the same lengths). Besides that, the function could output a named vector to inform the gene names as well. An example:

> my_genes <- c("MYC", "BCL2", "imaginary_gene")
> gene_to_region(gene_symbol = my_genes)
#                     MYC                    BCL2          imaginary_gene 
# "8:128747680-128753674"  "18:60790579-60987361"                      NA 

If useful, we can add a new boolean parameter sort_regions, where TRUE (must be specified by the user; default is FALSE) means to sort regions by their chromosome and start location (NAs always to the right hand).

> my_genes <- c("MYC", "imaginary_gene", "BCL2")
> gene_to_region(gene_symbol = my_genes, sort_regions = TRUE)
#                     MYC                    BCL2          imaginary_gene 
# "8:128747680-128753674"  "18:60790579-60987361"                      NA 

If anyone finds this change useful, I would be happy in making the change and submit a Pull Request.

mattssca commented 1 year ago

Thanks for posting this Vladimir. I like your suggestion that the returned regions follow the same order as the genes given to the function. However, have you played around with the return_as parameter? If set to "df" or "bed", the returned data frame will tell you what genes correspond to what regions.

> my_genes <- c("MYC", "BCL2", "imaginary_gene")
> gene_to_region(gene_symbol = my_genes, return_as = "df")
2 region(s) returned for 3 gene(s)
  chromosome     start       end gene_name hugo_symbol ensembl_gene_id
1         18  60790579  60987361      BCL2        BCL2 ENSG00000171791
2          8 128747680 128753674       MYC         MYC ENSG00000136997

The standard return format is what we refer to as a "region" format (chr:start-end). This functionality should remain since it returns the region for any given gene in a format that is accepted by other GAMBLR functions.

mattssca commented 1 year ago

Some room for improvement from my perspective;

  1. Correctly sort the return (chr, start), which could be default without the need to add another parameter.
  2. Maybe add a message that gets printed to the console for any regions/genes not found?

You could also have a look at the relative to this function regions_to_gene and see how/if the same suggested improvements would apply to this function.

rdmorin commented 1 year ago

Seems to me that Vladimier's suggestion of ordering the regions vector using gene_name is potentially useful and the named vector may have other benefits. Returning NA for ones that don't exist should probably not be the default behaviour though.