Closed Kdreval closed 1 year ago
In this part of the code, the function intentionally changes NA
s to counts like 2 (diploid):
#fill in any sample/region combinations with missing data as diploid
meta_arranged = these_samples_metadata %>%
dplyr::select(sample_id, pathology, lymphgen) %>%
arrange(pathology, lymphgen)
eg = expand_grid(sample_id = pull(meta_arranged, sample_id), region_name = as.character(unique(seg_df$region_name)))
all_cn = left_join(eg, seg_df, by = c("sample_id" = "sample_id", "region_name" = "region_name")) %>%
mutate(CN = replace_na(CN, 2))
I wonder if this has any useful application somehow. If so, instead of just removing the line mutate(CN = replace_na(CN, 2))
, would be better to create the new boolean parameter missing_data_as_diploid
where TRUE
means to replace NA
s to diploid (default equal to FALSE
)?
Then the new code would be:
meta_arranged = these_samples_metadata %>%
dplyr::select(sample_id, pathology, lymphgen) %>%
arrange(pathology, lymphgen)
eg = expand_grid(sample_id = pull(meta_arranged, sample_id), region_name = as.character(unique(seg_df$region_name)))
all_cn = left_join(eg, seg_df, by = c("sample_id" = "sample_id", "region_name" = "region_name"))
#fill in any sample/region combinations with missing data as diploid
if(missing_data_as_diploid){
all_cn = mutate(all_cn, CN = replace_na(CN, 2))
}
I think this is a good suggestion to have this configurable so we respect the legacy behavior but also have the flexibility to return NAs. Thanks!
As get_cn_states
is, it cannot differentiate samples that don't contain CN from samples that don't exist --- for both cases, the function generates NA
values. Also, I think getting a zero for the first case would be more informative than a NA
. Would it be interesting to make this change? For example, we could get outputs like this:
1:2487078-2496821 1:6581407-6614595 1:6650784-6674667 1:9711790-9789172 1:11166592-11322564
10-18191T 0 0 0 0 0
Imaginary_sample NA NA NA NA NA
HTMCP-01-06-00485-01A-01D 2 2 2 2 2
1:12227060-12269285
10-18191T 0
Imaginary_sample NA
HTMCP-01-06-00485-01A-01D 2
0 in this case would mean that there are 0 copies of DNA in that region (or gene), so would be interpreted as deletions. Returning 2 for samples that don't contain CN changes would just mean that they have a diploid state, and returning NA for those that don't exist would mean that this information is not available. So the 2 vs NA will be the distinction. So the table in this small example would be
1:2487078-2496821 1:6581407-6614595 1:6650784-6674667 1:9711790-9789172 1:11166592-11322564
10-18191T NA NA NA NA NA
Imaginary_sample NA NA NA NA NA
HTMCP-01-06-00485-01A-01D 2 2 2 2 2
1:12227060-12269285
10-18191T NA
Imaginary_sample NA
HTMCP-01-06-00485-01A-01D 2
It will mean that there is no data available for 10-18191T and Imaginary_sample, and these regions in HTMCP-01-06-00485-01A-01D are all diploid.
My interpretation is that 10-18191T
exists and there are no copies of DNA in that region (deletions). Isn't this right?
If 10-18191T
exists, the function shouldn't return NA for it, should it?
The sample itself exists but the copy number data for it is not - because it is intentionally deleted from the workflows. So for this application, it is the same as Imaginary_sample - the NA should be returned. It is easy to check whether or not the data is available, in this example it is
get_sample_cn_segments(
this_sample_id = my_meta$sample_id
) %>%
pull(ID) %>%
unique
The code you posted in the earlier comment regarding the missing_data_as_diploid
parameter is good and can be implemented as is 👍
I completely understood. Thank you so much for the explanation. I'm going to make the Pull Request.
Thanks for looking at it!
The issue was fixed by PR #210. I'm going to close the issue.
The
get_cn_states()
has a misleading feature in the code that needs to be dropped and deprecated. When the sample id does not have the CNV data, the output ofget_cn_states()
returns the matrix with neutral CN states instead of accurately reporting the missing value as NA.The minimal reproducible example that illustrates this problem:
This part has to be updated to keep the NA values instead of replacing them.