morinlab / GAMBLR

Set of standardized functions to operate with genomic data
https://morinlab.github.io/GAMBLR/
MIT License
3 stars 2 forks source link

Change `gene_to_region` and add `get_cnv_and_ssm_status` #218

Closed vladimirsouza closed 12 months ago

vladimirsouza commented 1 year ago

This pull request is a response to the #213 and #208 issues.

About #213 issue (gene_to_region function changes input order of genes):

Please, also see my last (June 26) comment on each file.

About #208 issue (New wrapper function to produce the SSM and CN joint matrix):

get_cnv_and_ssm_status returns NA

get_cnv_and_ssm_status( genes_and_cn_threshs = genes_and_cn_threshs, these_samples_metadata = these_samples_metadata, seq_type = seq_type, only_cnv = "all" )

2 region(s) returned for 2 gene(s)

MYC MIR17HG

BLGSP-71-19-00123-09A.1-01D NA 0

for this specified sample, get_cn_segments doesn't return cn segments from MYC

get_cn_segments(region = "8:128747680-128753674", streamlined = TRUE, this_seq_type = seq_type) %>% filter(ID %in% these_samples_metadata$sample_id)

[1] ID CN

<0 rows> (or 0-length row.names)

but get_cn_segments returns cn segments from MIR17HG

get_cn_segments(region = "13:92000074-92006833", streamlined = TRUE, this_seq_type = seq_type) %>% filter(ID %in% these_samples_metadata$sample_id)

ID CN

1 BLGSP-71-19-00123-09A.1-01D 2


# Pull Request Checklists

## Checklist for all PRs

### Required

- [x] I tested the new code for my use case (please provide a reproducible example of how you tested the new functionality)

I tried many different combinations of arguments. Here are two of them.

gene_to_region( gene_symbol = c("BCL2","imaginary_gene", "MYC"),

ensembl_id = c("ENSG00000171791","imaginary_gene", "ENSG00000136997"),

genome_build = c("grch37", "grch38") [1], return_as = c("region", "bed", "df") [2], sort_regions = c(TRUE, FALSE) [1], na_for_genes_not_found = c(TRUE, FALSE) [2] )

2 region(s) returned for 3 gene(s)

chromosome start end hugo_symbol

1 8 128747680 128753674 MYC

2 18 60790579 60987361 BCL2

gene_to_region(

gene_symbol = c("BCL2","imaginary_gene", "MYC"),

ensembl_id = c("ENSG00000171791","imaginary_gene", "ENSG00000136997"), genome_build = c("grch37", "grch38") [2], return_as = c("region", "bed", "df") [1], sort_regions = c(TRUE, FALSE) [2], na_for_genes_not_found = c(TRUE, FALSE) [1] )

2 region(s) returned for 3 gene(s)

ENSG00000171791 imaginary_gene ENSG00000136997

"chr18:63123345-63320128" NA "chr8:127735433-127742951"


- [x] I ensured all dplyr functions that commonly conflict with other packages are fully qualified. 

- [x] I generated the documentation and checked for errors relating to the new function (e.g. `devtools::document()`) and added `NAMESPACE` and all other modified files in the root directory and under `man`. 

## Checklist for New Functions

### Required

- [x] I documented my function using [Roxygen style](https://jozef.io/r102-addin-roxytags/#:~:text=Inserting%20a%20skeleton%20%2D%20Do%20this,Shift%2BAlt%2BR%20).)

- [x] Adequate function documentation (see [new-function documentation template](https://github.com/morinlab/GAMBLR#title) for more info)

- [x] I have ran `devtools::document()` to add the newly created function to NAMESPACE (do not manually add anything to this file!).

## Checklist for changes to existing code

- [x] I added/removed arguments to a function and updated documentation for all changed/new arguments

- [x] I tested the new code for compatibility with existing functionality in the Master branch (please provide a reprex of how you tested the original functionality)

fancy_ideogram(this_sample_id = "HTMCP-01-06-00422-01A-01D", gene_annotation = c("BCL2", "imaginary_gene", "MYC"), plot_title = "Sample-level Ideogram Example", plot_subtitle = "grch37")


`fancy_ideogram` uses `gene_to_region` internally. It makes this nice plot [here](https://drive.google.com/file/d/1gjrEXiW7zsVZbMHeziBxD9jzfY6gegnO/view?usp=sharing). 
vladimirsouza commented 1 year ago

I've addressed the feedback received in this PR. Please, review this PR.

An example output (output matches input gene order):

> gene_to_region(
+     gene_symbol = c("KLHL21", "BCL11A", "BCL2", "MYC", "PTPRD", "WAS", "FAS", "ATM"),
+     genome_build = "grch37",
+     return_as = "region",
+     sort_regions = FALSE
+ )
8 region(s) returned for 8 gene(s)
                  KLHL21                   BCL11A                     BCL2                      MYC 
     "1:6650784-6674667"    "2:60678302-60780702"   "18:60790579-60987361"  "8:128747680-128753674" 
                   PTPRD                      WAS                      FAS                      ATM 
    "9:8314246-10612723"    "X:48534985-48549818"   "10:90750414-90775542" "11:108093211-108239829" 

One more example including genes with unavailable regions and different parameters:

> gene_to_region(
+     gene_symbol = c("KLHL21", "BCL11A", "BCL2", "imaginary_gene", "MYC", "PTPRD", "WAS", "FAS", "ATM", "another_imaginary_gene"),
+     genome_build = "hg38",
+     return_as = "df",
+     sort_regions = TRUE
+ )
Some input gene(s) have no region info available. They are:
imaginary_gene, another_imaginary_gene.
8 region(s) returned for 10 gene(s)
  chromosome     start       end gene_name hugo_symbol ensembl_gene_id
1       chr1   6590723   6614607    KLHL21      KLHL21 ENSG00000162413
2       chr2  60450519  60554467    BCL11A      BCL11A ENSG00000119866
3       chr8 127735433 127742951       MYC         MYC ENSG00000136997
4       chr9   8314245  10613002     PTPRD       PTPRD ENSG00000153707
5      chr10  88953812  89029605       FAS         FAS ENSG00000026103
6      chr11 108223043 108369102       ATM         ATM ENSG00000149311
7      chr18  63123345  63320128      BCL2        BCL2 ENSG00000171791
8       chrX  48676595  48691431       WAS         WAS ENSG00000015285