Closed vladimirsouza closed 12 months ago
I've addressed the feedback received in this PR. Please, review this PR.
An example output (output matches input gene order):
> gene_to_region(
+ gene_symbol = c("KLHL21", "BCL11A", "BCL2", "MYC", "PTPRD", "WAS", "FAS", "ATM"),
+ genome_build = "grch37",
+ return_as = "region",
+ sort_regions = FALSE
+ )
8 region(s) returned for 8 gene(s)
KLHL21 BCL11A BCL2 MYC
"1:6650784-6674667" "2:60678302-60780702" "18:60790579-60987361" "8:128747680-128753674"
PTPRD WAS FAS ATM
"9:8314246-10612723" "X:48534985-48549818" "10:90750414-90775542" "11:108093211-108239829"
One more example including genes with unavailable regions and different parameters:
> gene_to_region(
+ gene_symbol = c("KLHL21", "BCL11A", "BCL2", "imaginary_gene", "MYC", "PTPRD", "WAS", "FAS", "ATM", "another_imaginary_gene"),
+ genome_build = "hg38",
+ return_as = "df",
+ sort_regions = TRUE
+ )
Some input gene(s) have no region info available. They are:
imaginary_gene, another_imaginary_gene.
8 region(s) returned for 10 gene(s)
chromosome start end gene_name hugo_symbol ensembl_gene_id
1 chr1 6590723 6614607 KLHL21 KLHL21 ENSG00000162413
2 chr2 60450519 60554467 BCL11A BCL11A ENSG00000119866
3 chr8 127735433 127742951 MYC MYC ENSG00000136997
4 chr9 8314245 10613002 PTPRD PTPRD ENSG00000153707
5 chr10 88953812 89029605 FAS FAS ENSG00000026103
6 chr11 108223043 108369102 ATM ATM ENSG00000149311
7 chr18 63123345 63320128 BCL2 BCL2 ENSG00000171791
8 chrX 48676595 48691431 WAS WAS ENSG00000015285
This pull request is a response to the #213 and #208 issues.
About #213 issue (gene_to_region function changes input order of genes):
return_as = "region"
, return a named vector where names are gene ids.sort_regions
(default is TRUE). If FALSE, the order of the output regions follows the same order of the input genes.na_for_genes_not_found
. The function never outputs NAs for genes not found anymore. If there are genes not found, the function prints a list with these genes.genome_build
argument aregrch37
orhg38
.Please, also see my last (June 26) comment on each file.
About #208 issue (New wrapper function to produce the SSM and CN joint matrix):
get_cnv_and_ssm_status
uses updatedgene_to_region
.augmented
argument is different from the ones inget_ssm_by_samples
's andget_coding_ssm_status
's help (I find them confusing). Please, check if it keeps the same meaning. My description:Argument to internally pass to the functions `get_ssm_by_samples` and `get_coding_ssm_status`. A logical parameter (default: TRUE). Set to FALSE to use multi-sample patients, instead of the original MAF from each sample.
. Description in get_ssm_by_samples and get_coding_ssm_status:default: TRUE. Set to FALSE if you instead want the original MAF from each sample for multi-sample patients instead of the augmented MAF.
subset_from_merge
andaugmented
because if using their default values, themin_read_support_ssm
argument (it's calledmin_read_support
inget_ssm_by_samples
andget_coding_ssm_status
) is useless. When runningget_ssm_by_samples
with default values for these arguments, theget_ssm_by_sample
(without 's') function is called internally, which in turn doesn't usemin_read_support
because the/projects/nhl_meta_analysis_scratch/gambl/results_local/gambl/slms_3-1.0_vcf2maf-1.3/99-outputs/deblacklisted/augmented_maf/genome--projection/BLGSP-71-06-00160-01A-03D--BLGSP-71-06-00160-12A-01D--matched.slms-3.final.grch37.maf
file doesn't exists (see code). P.S.: If we don't give both argumentsmaf_data
andmaf_path
toget_coding_ssm_status
, it callsget_coding_ssm
with fixed value formin_read_support
= 3 (code).get_cn_states
and the reason is thatget_cn_segments
doesn't return CN segments for some combinations of genes and samples. Here is an example:get_cnv_and_ssm_status
's output, and this is not an error of the function. NAs happen while runningget_cn_states
and the reason is thatget_cn_segments
doesn't return CN segments for some combinations of genes and samples. Here is an example:get_cnv_and_ssm_status returns NA
get_cnv_and_ssm_status( genes_and_cn_threshs = genes_and_cn_threshs, these_samples_metadata = these_samples_metadata, seq_type = seq_type, only_cnv = "all" )
2 region(s) returned for 2 gene(s)
MYC MIR17HG
BLGSP-71-19-00123-09A.1-01D NA 0
for this specified sample, get_cn_segments doesn't return cn segments from MYC
get_cn_segments(region = "8:128747680-128753674", streamlined = TRUE, this_seq_type = seq_type) %>% filter(ID %in% these_samples_metadata$sample_id)
[1] ID CN
<0 rows> (or 0-length row.names)
but get_cn_segments returns cn segments from MIR17HG
get_cn_segments(region = "13:92000074-92006833", streamlined = TRUE, this_seq_type = seq_type) %>% filter(ID %in% these_samples_metadata$sample_id)
ID CN
1 BLGSP-71-19-00123-09A.1-01D 2
I tried many different combinations of arguments. Here are two of them.
gene_to_region( gene_symbol = c("BCL2","imaginary_gene", "MYC"),
ensembl_id = c("ENSG00000171791","imaginary_gene", "ENSG00000136997"),
genome_build = c("grch37", "grch38") [1], return_as = c("region", "bed", "df") [2], sort_regions = c(TRUE, FALSE) [1], na_for_genes_not_found = c(TRUE, FALSE) [2] )
2 region(s) returned for 3 gene(s)
chromosome start end hugo_symbol
1 8 128747680 128753674 MYC
2 18 60790579 60987361 BCL2
gene_to_region(
gene_symbol = c("BCL2","imaginary_gene", "MYC"),
ensembl_id = c("ENSG00000171791","imaginary_gene", "ENSG00000136997"), genome_build = c("grch37", "grch38") [2], return_as = c("region", "bed", "df") [1], sort_regions = c(TRUE, FALSE) [2], na_for_genes_not_found = c(TRUE, FALSE) [1] )
2 region(s) returned for 3 gene(s)
ENSG00000171791 imaginary_gene ENSG00000136997
"chr18:63123345-63320128" NA "chr8:127735433-127742951"
fancy_ideogram(this_sample_id = "HTMCP-01-06-00422-01A-01D", gene_annotation = c("BCL2", "imaginary_gene", "MYC"), plot_title = "Sample-level Ideogram Example", plot_subtitle = "grch37")