morinlab / GAMBLR

Set of standardized functions to operate with genomic data
https://morinlab.github.io/GAMBLR/
MIT License
3 stars 2 forks source link

New wrapper function to produce the SSM and CN joint matrix #208

Closed Kdreval closed 8 months ago

Kdreval commented 1 year ago

We have functionality in GAMBLR to generate the feature matrix for simple mutations (get_coding_ssm_status()) and another function that returns copy number state of the gene of interest (get_cn_states()). What is missing is the wrapper function that will run both of these to generate a single binary matrix (0 for no feature and 1 for the presence of the feature) where either mutation or CNV will be considered.

Here is an example:

my_genes <- "MYC"

my_regions <- grch37_lymphoma_genes_bed %>%
    filter(hgnc_symbol %in% my_genes) %>%
    mutate(region = paste0(
        chromosome_name,
        ":",
        start_position,
        "-",
        end_position
    )) %>%
    pull(region)

example_ids <- c(
    "BLGSP-71-06-00160-01A-03D",
    "BLGSP-71-06-00252-01A-01D",
    "BLGSP-71-19-00122-09A.1-01D",
    "BLGSP-71-19-00523-09A-01D",
    "BLGSP-71-21-00187-01A-01E",
    "BLGSP-71-21-00188-01A-04E"
)

my_meta <- get_gambl_metadata() %>%
    filter(sample_id %in% example_ids)

my_maf <- get_ssm_by_samples(
    these_samples_metadata = my_meta
)

cn_matrix <- get_cn_states(
    regions_list = my_regions,
    these_samples_metadata = my_meta,
    region_names = my_genes
)

ssm_matrix <- get_coding_ssm_status(
    gene_symbols = my_genes,
    these_samples_metadata = my_meta,
    maf_data = my_maf,
    from_flatfile = FALSE,
    include_hotspots = FALSE
)

In this example, 3 samples do not have SSM (the ssm_matrix has 0 for the mutation presence), but they have CNV (the cn_matrix has copy number higher than 2). The new function will aggregate these events, and all samples in the example will have 1 for the combined feature MYC_Mut_or_AMP

                    sample_id MYC_Mut_or_AMP
1   BLGSP-71-06-00160-01A-03D              1
2   BLGSP-71-06-00252-01A-01D              1
3 BLGSP-71-19-00122-09A.1-01D              1
4   BLGSP-71-19-00523-09A-01D              1
5   BLGSP-71-21-00187-01A-01E              1
6   BLGSP-71-21-00188-01A-04E              1

Since this will be handling the CN data, there should be a function parameter to dynamically handle a cutoff for the absolute CN when considering the event as a feature (for example, we can disregard one copy gains with CN of 3 or 2 copy gains with CN of 4 etc).

vladimirsouza commented 8 months ago

Issue solved in this PR.