ucscXena / xena-geneset-analysis-service

node API service that runs and caches analysis results for genesets
0 stars 0 forks source link

push R code into API server to allow bulk upload #2

Closed nathandunn closed 3 years ago

nathandunn commented 3 years ago

If we want to load 50K genesets, won't really use client. We also have the problem of genesets being different for different analysis methods.

Our options are:

  1. bulk upload into separate buckets (for each cohort) for the pathway data
  2. provide a service to store the pathway scores available (JSON blobs? ) . . . an intermediate
  3. add a name field to the genesets, but we need to provide a fast way to query them
nathandunn commented 3 years ago

When doing bulk load calculations for BPA, we use handleMeanActivityData. This pulls the default gene sets and scores and sorts and filters . . . all within the client. For BPA this is 8K+ gene sets. So doing this on the client-side is pretty straight-forward. Here we are simply pulling data from alternate data sources.

If we were going to further load ALL data into the same gene sets.

Method 1. I think we should reserve the buckets for individual data (expression vs CNV).

Method 2. An intermediate service as from ucscXena/XenaGoWidget#645 should work as well as scale and provide a simple means for adding additional data. Specifically this is a PathwayAnalysisService.

Method 3. Using separate buckets for each cohorts, we have to then filter by names and then filter back out again and will continue to get slower as we add data. If we were querying on the servers-side, this would work, but this won't work.

nathandunn commented 3 years ago

For some reason the R analysis stalls on VERY large loads. However, if I run the same loads independently, they take about 4 minutes each:

time Rscript ../analysis-wrapper.R  9606-bp-experimental.gmt_converted.tsv.gz  TCGA-PRAD_tpm_tab.tsv 9606-bp-experimental.gmt_converted.tsv.gz-genesets-PRAD.tsv

time Rscript ../analysis-wrapper.R  9606-bp-experimental.gmt_converted.tsv.gz  TCGA-OV_tpm_tab.tsv 9606-bp-experimental.gmt_converted.tsv.gz-genesets-OV.tsv  

Both are 12K long gene sets. Not sure why this is timing out.

nathandunn commented 3 years ago

We can test this with the web services:

time curl -v -F tpmdata=@test-data/TCGA-CHOL_logtpm_forTesting.tsv -F gmtdata=@test-data/Xena_manual_pathways.gmt http://localhost:8000/bpa_analysis

should be:

time curl -v -F tpmdata=@generate_all/TCGA-OV_tpm_tab.tsv -F gmtdata=@generate_all/9606-bp-experimental.gmt_converted.tsv http://localhost:8000/bpa_analysis