push R code into API server to allow bulk upload

nathandunn commented 3 years ago

If we want to load 50K genesets, won't really use client. We also have the problem of genesets being different for different analysis methods.

Our options are:

bulk upload into separate buckets (for each cohort) for the pathway data
provide a service to store the pathway scores available (JSON blobs? ) . . . an intermediate
add a name field to the genesets, but we need to provide a fast way to query them

nathandunn commented 3 years ago

When doing bulk load calculations for BPA, we use handleMeanActivityData. This pulls the default gene sets and scores and sorts and filters . . . all within the client. For BPA this is 8K+ gene sets. So doing this on the client-side is pretty straight-forward. Here we are simply pulling data from alternate data sources.

If we were going to further load ALL data into the same gene sets.

Method 1. I think we should reserve the buckets for individual data (expression vs CNV).

Method 2. An intermediate service as from ucscXena/XenaGoWidget#645 should work as well as scale and provide a simple means for adding additional data. Specifically this is a PathwayAnalysisService.

Method 3. Using separate buckets for each cohorts, we have to then filter by names and then filter back out again and will continue to get slower as we add data. If we were querying on the servers-side, this would work, but this won't work.

nathandunn commented 3 years ago

For some reason the R analysis stalls on VERY large loads. However, if I run the same loads independently, they take about 4 minutes each:

time Rscript ../analysis-wrapper.R  9606-bp-experimental.gmt_converted.tsv.gz  TCGA-PRAD_tpm_tab.tsv 9606-bp-experimental.gmt_converted.tsv.gz-genesets-PRAD.tsv

time Rscript ../analysis-wrapper.R  9606-bp-experimental.gmt_converted.tsv.gz  TCGA-OV_tpm_tab.tsv 9606-bp-experimental.gmt_converted.tsv.gz-genesets-OV.tsv

Both are 12K long gene sets. Not sure why this is timing out.

nathandunn commented 3 years ago

We can test this with the web services:

time curl -v -F tpmdata=@test-data/TCGA-CHOL_logtpm_forTesting.tsv -F gmtdata=@test-data/Xena_manual_pathways.gmt http://localhost:8000/bpa_analysis

should be:

time curl -v -F tpmdata=@generate_all/TCGA-OV_tpm_tab.tsv -F gmtdata=@generate_all/9606-bp-experimental.gmt_converted.tsv http://localhost:8000/bpa_analysis

ucscXena / xena-geneset-analysis-service

push R code into API server to allow bulk upload #2