matthewhirschey / ddh.org

datadrivenhypothesis.org is a resource to query 100+ GB of raw biological science data to develop data-driven hypotheses
2 stars 7 forks source link

Validation branch to remove bias in na_cutoff and quantify effects of special sauce #71

Closed matthewhirschey closed 4 years ago

matthewhirschey commented 4 years ago

This PR has 3 components.

  1. I've added validation.Rmd and changed depmap_explore.Rmd, both in the EDA folder, that is "validation" code to help us identify hits to validate in the lab. Hence the name of this branch. Nothing to review here as these do not change the behavior of the app.

  2. I've added a new script called find_threshold.R. The idea here is to remove bias in identifying the na_cutoff value, which is hard coded as a variable in current_release.R. This code should take the URLs in current_release.R, download the data, and then generate a correlation matrix on the raw data, remove zero-expression values, and then calculates the relationship between number of cells lines the gene gets its correlation value from vs. the number of genes that are above the resampled mean +3SD. Then it removes the top 5%, with the working assumption that these are false-positives (5% FDR is a reasonable statistical approach), which then finally spits out a value of the number of cell lines where this 5% line crosses.

This is going to change my workflow, slightly. The new quarterly update process will begin with getting the URLs of the new datasets entering them in current_release.R, running find_threshold.R (takes ~1hr to run; importantly, the na_cutoff value from current_release is not called in find_threshold.R), and then entering the printed value in current_release.R for generate_x.R scripts to use.

Candidly, I'd like to be a little more aggressive in the cutoff, as I think there is still a bunch of garbage in there, but this will be a good start.

  1. I've updated the methods to reflect these changes, and to add new graphs that quantifies some of these changes.

fixes issue #69

matthewhirschey commented 4 years ago

One note @johnbradley, I've added on package called patchwork to the methods document used to place two graphs next to eachother. This is never called in the app, so not sure if you want to add it to the master container build or not. So far, I've been making the methods documents locally, and serving them as markdown and image files.

matthewhirschey commented 4 years ago

What would be useful @johnbradley would be to:

johnbradley commented 4 years ago

@matthewhirschey I ran find_threadshold.R and got 631 which is the value already in current_release.R. I'm running the generate_*.R scripts now.

If find_threadshold.R is the definitive source for the na_cutoff value should find_threadshold.R save na_cutoff to a file instead of printing? This would avoid the possibility of forgetting or mistyping the na_cutoff value.

matthewhirschey commented 4 years ago

That’s a great suggested change. I thought about the same chicken/egg problem, but hadn’t considered the problem that it might fail if no file exists. At first I did not specifically name the file with the current release prefix, so there would always be a file there. But then when I changed it to record the release by appending 20Q1, I see the failure with 20Q2.

On Mar 9, 2020, at 7:56 AM, John Bradley notifications@github.com wrote:



@johnbradley commented on this pull request.


In code/current_release.Rhttps://urldefense.com/v3/__https://github.com/matthewhirschey/ddh/pull/71*discussion_r389614231__;Iw!!OToaGQ!8j6wr4OB-pqkAw_sQtU0CdcTfK18A6H-_Nra8E-wCjs0nkgVmzkVwLN01B5BqPSjy-REhBU$:

\ No newline at end of file +na_cutoff <- readRDS(file = here::here("data", paste0(release, "_na_cutoff.Rds"))) #~5% FDR

I think you have a chicken and egg problem here. This file is sourced by find_threadshold.R which will create the 20Q2_na_cutoff.Rds file but fails with No such file or directory. One possible solution is to only set na_cutoff if the file exists:

⬇️ Suggested change

-na_cutoff <- readRDS(file = here::here("data", paste0(release, "_na_cutoff.Rds"))) #~5% FDR +na_cutoff_file = here::here("data", paste0(release, "_na_cutoff.Rds")) +if (file.exists(na_cutoff_file)) {

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://urldefense.com/v3/__https://github.com/matthewhirschey/ddh/pull/71?email_source=notifications&email_token=AIZYT7NKU3ZZLNBVCHKGY2DRGTROJA5CNFSM4LDPNQU2YY3PNVWWK3TUL52HS4DFWFIHK3DMKJSXC5LFON2FEZLWNFSXPKTDN5WW2ZLOORPWSZGOCYPJLTY*pullrequestreview-371103183__;Iw!!OToaGQ!8j6wr4OB-pqkAw_sQtU0CdcTfK18A6H-_Nra8E-wCjs0nkgVmzkVwLN01B5BqPSjD4D3LbU$, or unsubscribehttps://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/AIZYT7JP3JB2QM2Y2QIFZX3RGTROJANCNFSM4LDPNQUQ__;!!OToaGQ!8j6wr4OB-pqkAw_sQtU0CdcTfK18A6H-_Nra8E-wCjs0nkgVmzkVwLN01B5BqPSjIzRq1JQ$.