matthewhirschey / ddh.org

datadrivenhypothesis.org is a resource to query 100+ GB of raw biological science data to develop data-driven hypotheses
3 stars 7 forks source link

Pubmed #72

Closed matthewhirschey closed 4 years ago

matthewhirschey commented 4 years ago

This is a major update, and now includes pubmed data. Key changes:

  1. update_gene_summary.R pulls data from two sources to add information to the gene landing page (NB, data not yet added given the re-org of this page with the pending pathways update)

  2. generate_pubmed_data.R generates a large data file of the relationship between all genes for co-publication. To avoid missing data between gene names and subsequent app crashes, generates dummy data for 17 genes, which results in all zeros for these only.

  3. Large (2GB) pubmed dataset is called in generate_depmap_tables.R to add a few columns to these tables, but importantly is not loaded in the shiny app to keep performance/responsiveness up.

  4. Added ability for user to select which columns should be displayed in dependency tables; tables are getting busy so, this keeps them lean.

  5. Updated methods to include this new information, as well as some aesthetic changes, which means that I am now rendering methods.Rmd to methods.html (not the methods.md intermediate). Importantly, the html that is generated is only a "fragment", which includes only info. This is so that it can be loaded in shiny, which adds , etc.

  6. Methods update now includes some of the functions from app.R. So I pulled the common functions out of app.R, but them in fun_*.R scripts, and loaded them in app.R and methods.Rmd. This will be the home for these functions, and means I only need to change them in one spot. The shiny-specific functions are still specifically called in the shiny app. I needed to add some more variables in these functions, (specifically calling the dataset) so they would be available in the function's environment. But app is working for me.

Took all week, but LGTM. If all of this looks good to you, then we can update 20Q1 data to reflect this.

johnbradley commented 4 years ago

I am having trouble running generate_pubmed_data.R. I tried to run locally but my laptop got sluggish and I killed it (it reached about 30G of memory used). It seems to take a really large amount of memory. I had it fail due to needing more than 100G of memory when running on our slurm cluster.

matthewhirschey commented 4 years ago

Hm. I got a new laptop, with 64GB of memory, and upped Rstudio's allocation to 1000GB (!) and ran fine for me...only about 50', but I don't know how much it needs.

The step that takes the most memory begins at line 30. If you want to test the code, then you could insert at line 31 slice(1:1000000) %>% #keep 1M from 21M

Also, here is a link to the 2GB output file: https://www.dropbox.com/s/m0oxpnnuqq3yyxt/20Q1_pubmed_concept_pairs.Rds?dl=0 This is the output file from the script, and needed for generate_tables.R

Thoughts about how to overcome this? Can you increase the the memory allocation on the cluster?

johnbradley commented 4 years ago

Thoughts about how to overcome this? Can you increase the the memory allocation on the cluster?

We have some larger 500GB memory nodes on the cluster, but it may take a little while to get the job scheduled since they are quite popular. I'll try running it there. There may also be some R configuration to have it use less memory and more disk for storing the intermediate data. I'll look into that.

johnbradley commented 4 years ago

I was finally able to run the build steps all the way through. The generate_pubmed_data.R step took around 30 minutes and used 160 G of memory. The website looks good and everything seems to be working after I upgraded ggplot2. 👍

@matthewhirschey Could you merge update_gene_summary.R into create_gene_summary.R? I think you can just to add the guts of update_gene_summary.R in between these two lines: https://github.com/matthewhirschey/ddh/blob/16b744b7e1089a74a783006fbc3c10ba1c7198e3/code/create_gene_summary.R#L63-L64 Perhaps wrap up the guts of update_gene_summary.R into a function.

I still need to update the Makefile and update the slurm job to handle the larger memory requirement for the pubmed generation step. I would like to do that in a separate PR.

matthewhirschey commented 4 years ago

The reason I wrote update_gene_summary, rather than in the create file was because I thought that we wouldn't generate data too often, but instead would update it from time to time. But I suppose this would be easy to wrap together, and it'd probably be on a yearly update cycle, or something around that time frame.

I'll work on that, and then push to the pubmed branch (here) for your review.

matthewhirschey commented 4 years ago

I added a function update_gene_summary to thecreate_gene_summary.R script. I cannot test the last function, which calls both intermediate functions. But I tested the update summary by: gene_summary <- update_gene_summary(gene_summary, gene2pubmed_url, pubtator_url) and it take the 11 variable input df and returns a 15 variable gene_summary df.

@johnbradley Can you test the last function without running the entire dataset again? Alternatively, feel free to let the whole thing loose, and see if we can generate the whole dataset from scratch.

If OK, then you can delete update_gene_summary.R

johnbradley commented 4 years ago

@matthewhirschey I ran create_gene_summary.R and it finished fine with the latest changes.