Closed nathandunn closed 3 years ago
When doing bulk load calculations for BPA, we use handleMeanActivityData
. This pulls the default gene sets and scores and sorts and filters . . . all within the client. For BPA this is 8K+ gene sets. So doing this on the client-side is pretty straight-forward. Here we are simply pulling data from alternate data sources.
If we were going to further load ALL data into the same gene sets.
Method 1. I think we should reserve the buckets for individual data (expression vs CNV).
Method 2. An intermediate service as from ucscXena/XenaGoWidget#645 should work as well as scale and provide a simple means for adding additional data. Specifically this is a PathwayAnalysisService.
Method 3. Using separate buckets for each cohorts, we have to then filter by names and then filter back out again and will continue to get slower as we add data. If we were querying on the servers-side, this would work, but this won't work.
For some reason the R analysis stalls on VERY large loads. However, if I run the same loads independently, they take about 4 minutes each:
time Rscript ../analysis-wrapper.R 9606-bp-experimental.gmt_converted.tsv.gz TCGA-PRAD_tpm_tab.tsv 9606-bp-experimental.gmt_converted.tsv.gz-genesets-PRAD.tsv
time Rscript ../analysis-wrapper.R 9606-bp-experimental.gmt_converted.tsv.gz TCGA-OV_tpm_tab.tsv 9606-bp-experimental.gmt_converted.tsv.gz-genesets-OV.tsv
Both are 12K long gene sets. Not sure why this is timing out.
We can test this with the web services:
time curl -v -F tpmdata=@test-data/TCGA-CHOL_logtpm_forTesting.tsv -F gmtdata=@test-data/Xena_manual_pathways.gmt http://localhost:8000/bpa_analysis
should be:
time curl -v -F tpmdata=@generate_all/TCGA-OV_tpm_tab.tsv -F gmtdata=@generate_all/9606-bp-experimental.gmt_converted.tsv http://localhost:8000/bpa_analysis
If we want to load 50K genesets, won't really use client. We also have the problem of genesets being different for different analysis methods.
Our options are: