`get_cn_states()` output is misleading for missing samples

Kdreval commented 1 year ago

The get_cn_states() has a misleading feature in the code that needs to be dropped and deprecated. When the sample id does not have the CNV data, the output of get_cn_states() returns the matrix with neutral CN states instead of accurately reporting the missing value as NA.

The minimal reproducible example that illustrates this problem:

my_regions <- grch37_lymphoma_genes_bed %>%
    head %>%
    mutate(region = paste0(
        chromosome_name,
        ":",
        start_position,
        "-",
        end_position
    )) %>%
    pull(region)

my_meta <- get_gambl_metadata() %>%
    filter(sample_id %in% c(
        "10-18191T", # This is chromium sample omitted from CNV workflows and missing the CNV results
        "HTMCP-01-06-00485-01A-01D", # Some random ids
        "012-19-01TD"
    )
)

# Now running the get_cn_states produces this matrix
get_cn_states(
    regions_list = my_regions,
    these_samples_metadata = my_meta
)

                          1:2487078-2496821 1:6581407-6614595 1:6650784-6674667 1:9711790-9789172 1:11166592-11322564 1:12227060-12269285
10-18191T                                 2                 2                 2                 2                   2                   2
012-19-01TD                               2                 1                 1                 1                   1                   2
HTMCP-01-06-00485-01A-01D                 2                 2                 2                 2                   2                   2

# But the real reason the first sample has all diploids is because it does not have CNV outputs
# We can double-check this by running
get_sample_cn_segments(
    this_sample_id = my_meta$sample_id
) %>%
    pull(ID) %>%
    unique

[1] "HTMCP-01-06-00485-01A-01D" "012-19-01TD" 

# More wild example where one of the IDs is clobbered
clobbered_meta <- my_meta %>%
    mutate(sample_id = c("Imaginary_sample", my_meta$sample_id[2:3]))

get_cn_states(
    regions_list = my_regions,
    these_samples_metadata = clobbered_meta
)

                          1:2487078-2496821 1:6581407-6614595 1:6650784-6674667 1:9711790-9789172 1:11166592-11322564 1:12227060-12269285
10-18191T                                 2                 2                 2                 2                   2                   2
Imaginary_sample                          2                 2                 2                 2                   2                   2
HTMCP-01-06-00485-01A-01D                 2                 2                 2                 2                   2                   2

This part has to be updated to keep the NA values instead of replacing them.

vladimirsouza commented 1 year ago

In this part of the code, the function intentionally changes NAs to counts like 2 (diploid):

  #fill in any sample/region combinations with missing data as diploid
  meta_arranged = these_samples_metadata %>%
    dplyr::select(sample_id, pathology, lymphgen) %>%
    arrange(pathology, lymphgen)

  eg = expand_grid(sample_id = pull(meta_arranged, sample_id), region_name = as.character(unique(seg_df$region_name)))
  all_cn = left_join(eg, seg_df, by = c("sample_id" = "sample_id", "region_name" = "region_name")) %>%
    mutate(CN = replace_na(CN, 2))

I wonder if this has any useful application somehow. If so, instead of just removing the line mutate(CN = replace_na(CN, 2)), would be better to create the new boolean parameter missing_data_as_diploid where TRUE means to replace NAs to diploid (default equal to FALSE)?

Then the new code would be:

  meta_arranged = these_samples_metadata %>%
    dplyr::select(sample_id, pathology, lymphgen) %>%
    arrange(pathology, lymphgen)

  eg = expand_grid(sample_id = pull(meta_arranged, sample_id), region_name = as.character(unique(seg_df$region_name)))
  all_cn = left_join(eg, seg_df, by = c("sample_id" = "sample_id", "region_name" = "region_name"))

  #fill in any sample/region combinations with missing data as diploid
  if(missing_data_as_diploid){
    all_cn = mutate(all_cn, CN = replace_na(CN, 2))
  }

Kdreval commented 1 year ago

I think this is a good suggestion to have this configurable so we respect the legacy behavior but also have the flexibility to return NAs. Thanks!

vladimirsouza commented 1 year ago

As get_cn_states is, it cannot differentiate samples that don't contain CN from samples that don't exist --- for both cases, the function generates NA values. Also, I think getting a zero for the first case would be more informative than a NA. Would it be interesting to make this change? For example, we could get outputs like this:

                          1:2487078-2496821 1:6581407-6614595 1:6650784-6674667 1:9711790-9789172 1:11166592-11322564
10-18191T                                 0                 0                 0                 0                   0
Imaginary_sample                         NA                NA                NA                NA                  NA
HTMCP-01-06-00485-01A-01D                 2                 2                 2                 2                   2
                          1:12227060-12269285
10-18191T                                   0
Imaginary_sample                           NA
HTMCP-01-06-00485-01A-01D                   2

Kdreval commented 1 year ago

0 in this case would mean that there are 0 copies of DNA in that region (or gene), so would be interpreted as deletions. Returning 2 for samples that don't contain CN changes would just mean that they have a diploid state, and returning NA for those that don't exist would mean that this information is not available. So the 2 vs NA will be the distinction. So the table in this small example would be

                          1:2487078-2496821 1:6581407-6614595 1:6650784-6674667 1:9711790-9789172 1:11166592-11322564
10-18191T                                 NA                 NA                 NA                 NA                   NA
Imaginary_sample                         NA                NA                NA                NA                  NA
HTMCP-01-06-00485-01A-01D                 2                 2                 2                 2                   2
                          1:12227060-12269285
10-18191T                                   NA
Imaginary_sample                           NA
HTMCP-01-06-00485-01A-01D                   2

It will mean that there is no data available for 10-18191T and Imaginary_sample, and these regions in HTMCP-01-06-00485-01A-01D are all diploid.

vladimirsouza commented 1 year ago

My interpretation is that 10-18191T exists and there are no copies of DNA in that region (deletions). Isn't this right? If 10-18191T exists, the function shouldn't return NA for it, should it?

Kdreval commented 1 year ago

The sample itself exists but the copy number data for it is not - because it is intentionally deleted from the workflows. So for this application, it is the same as Imaginary_sample - the NA should be returned. It is easy to check whether or not the data is available, in this example it is

get_sample_cn_segments(
    this_sample_id = my_meta$sample_id
) %>%
    pull(ID) %>%
    unique

The code you posted in the earlier comment regarding the missing_data_as_diploid parameter is good and can be implemented as is 👍

vladimirsouza commented 1 year ago

I completely understood. Thank you so much for the explanation. I'm going to make the Pull Request.

Kdreval commented 1 year ago

Thanks for looking at it!

vladimirsouza commented 1 year ago

The issue was fixed by PR #210. I'm going to close the issue.

morinlab / GAMBLR

`get_cn_states()` output is misleading for missing samples #207