Add apiserver endpoints for cg-gnn

CarlinLiao commented 1 year ago

Per in-person discussion, we'll be implementing cg-gnn functionality in the apiserver as follows:

Manually precompute and save cg-gnn artifacts (just importance scores, I think) to remote server
- For each study, the manual precomputation will entail choosing a target y-variable/vector, building and training a cg-gnn model pipeline targeting y, and using the train model to derive importance scores for each cell (histological structure) in the slides the model is used for (e.g., pre-intervention slides only).
Given a study by the api, pull up relevant cg-gnn metrics from precomputed artifacts
Execute real time computation using cg-gnn metrics according to user specification

If a user wants to use the full feature suite of the cggnn module, they are advised to install the SPT or cg-gnn packages locally and go from there.

Open items

What derived metrics do we want include? @jimmymathews
Define API call structure.
Implement the API calls.

Implementation is dependent on #162 for API refactor before adding a new feature to the API.

jimmymathews commented 1 year ago

As a general pattern, the only target y-variable which is reliably available across datasets is the cohort assignment (painstakingly ascertained from all the metadata). Let us focus on that one (note however that there is some vague idea in the works to make it possible to add session-local outcome variable data client-side in the web application).

I will suggest storing the top 1000 cells per sample, in order of importance score. May not need to store the scores themselves, just the priority order. And doing so directly in a new database table. The schema could be as in:

data_analysis_study	histological_structure	importance_order
dataset1 ... GNN importance for sample cohort, mm-dd-yyyy ...	1337	1
dataset1 ... GNN importance for sample cohort, mm-dd-yyyy ...	2309	2
...	...	...
dataset1 ... GNN importance for sample cohort, mm-dd-yyyy ...	101	1000
dataset2 ... GNN importance for sample cohort, mm-dd-yyyy ...	2001	1
...	...	....

It would be nice to then have a function provided by the db module that reports, for a provided importance-priority threshold, the fraction of such filtered cells as a function of phenotype (either single channel or pre-defined composite) and of sample. (With 1000 maximum stored for each slide, this should be more or less instantaneous to calculate.)

It is reasonable to have a separate function which does the same calculation but for one specific provided combination/composite phenotype (whether or not it is one of the pre-defined combinations).

CarlinLiao commented 1 year ago

We'll add a function to the cggnn module to import importance scores from the CSV output of the cg-gnn pip library into memory, process them the ordinal format we want to use for the API server, and then pass it along to the db module.

We'll add a function to the db module that creates the new table outlined above.

I take it since we have the histological_structure we don't need to specify the study they came from?

jimmymathews commented 1 year ago

That is correct! There are actually two ways to recover the study: from the cell identifier ("histological structure") and from the data analysis study. The latter way is more direct.

jimmymathews commented 1 year ago

(For the above to be true, one needs to create a new data analysis study at the time of importing all this cggnn stuff into the db.)

CarlinLiao commented 1 year ago

A thought occurs: interviewers love it when an interviewee has deployed deep learning models in production. While we shouldn't have this hold up SPT publication, it would be nice to look into deploying the GNN on AWS at some point.

sanadeem commented 1 year ago

That's on the agenda. We will do it in the future with the clinical single-sample SPT deployment with input to GNNs coming from the DeepLIIF/ImPartial output.

CarlinLiao commented 1 year ago

I've created a PR in ADI schemas to add the importance order table as defined by Jimmy, but after sketching out the upload process in the mold of db/fractions_transcriber.py it occurs to me, should we be creating a separate importances table, or just uploading the cell importance orders to the quantitative_feature_value, feature_specification, and feature_specifier tables?

jimmymathews commented 1 year ago

Yes somehow when writing my comments above I forgot that there is already a place for derived feature values just like this... I guess because so far that system is used for sample-level features, but there is nothing restricting the feature subject to be slide/sample. We can use histological structure ID (cell). I might actually also suggest the mold of ondemand/proximity.py rather than fractions_transcriber.py. Because this also includes a lightweight "task queue" that allows long-running computation to go to the background. The task queue implementation there is nearly generalized to any computed feature, so that could be factored out and re-used. (This is not necessary, since the per-cell importances rank is getting computed only once after the gnn workflow. The question of "ondemand" retrieval is separate.)

jimmymathews commented 1 year ago

This would also save us the trouble of updating scstudies schema with something specific to this workflow.

CarlinLiao commented 1 year ago

In that case you can reject the PR for ADI schemas and I'll push the changes I made using the fractions pattern in SPT.

nadeemlab / SPT

Add apiserver endpoints for cg-gnn #171