Closed CarlinLiao closed 1 year ago
As a general pattern, the only target y-variable which is reliably available across datasets is the cohort assignment (painstakingly ascertained from all the metadata). Let us focus on that one (note however that there is some vague idea in the works to make it possible to add session-local outcome variable data client-side in the web application).
I will suggest storing the top 1000 cells per sample, in order of importance score. May not need to store the scores themselves, just the priority order. And doing so directly in a new database table. The schema could be as in:
data_analysis_study | histological_structure | importance_order |
---|---|---|
dataset1 ... GNN importance for sample cohort, mm-dd-yyyy ... | 1337 | 1 |
dataset1 ... GNN importance for sample cohort, mm-dd-yyyy ... | 2309 | 2 |
... | ... | ... |
dataset1 ... GNN importance for sample cohort, mm-dd-yyyy ... | 101 | 1000 |
dataset2 ... GNN importance for sample cohort, mm-dd-yyyy ... | 2001 | 1 |
... | ... | .... |
It would be nice to then have a function provided by the db module that reports, for a provided importance-priority threshold, the fraction of such filtered cells as a function of phenotype (either single channel or pre-defined composite) and of sample. (With 1000 maximum stored for each slide, this should be more or less instantaneous to calculate.)
It is reasonable to have a separate function which does the same calculation but for one specific provided combination/composite phenotype (whether or not it is one of the pre-defined combinations).
We'll add a function to the cggnn
module to import importance scores from the CSV output of the cg-gnn
pip library into memory, process them the ordinal format we want to use for the API server, and then pass it along to the db module.
We'll add a function to the db module that creates the new table outlined above.
I take it since we have the histological_structure we don't need to specify the study they came from?
That is correct! There are actually two ways to recover the study: from the cell identifier ("histological structure") and from the data analysis study. The latter way is more direct.
(For the above to be true, one needs to create a new data analysis study at the time of importing all this cggnn stuff into the db.)
A thought occurs: interviewers love it when an interviewee has deployed deep learning models in production. While we shouldn't have this hold up SPT publication, it would be nice to look into deploying the GNN on AWS at some point.
That's on the agenda. We will do it in the future with the clinical single-sample SPT deployment with input to GNNs coming from the DeepLIIF/ImPartial output.
I've created a PR in ADI schemas to add the importance order table as defined by Jimmy, but after sketching out the upload process in the mold of db/fractions_transcriber.py
it occurs to me, should we be creating a separate importances table, or just uploading the cell importance orders to the quantitative_feature_value, feature_specification, and feature_specifier tables?
Yes somehow when writing my comments above I forgot that there is already a place for derived feature values just like this... I guess because so far that system is used for sample-level features, but there is nothing restricting the feature subject to be slide/sample. We can use histological structure ID (cell).
I might actually also suggest the mold of (This is not necessary, since the per-cell importances rank is getting computed only once after the gnn workflow. The question of "ondemand" retrieval is separate.)ondemand/proximity.py
rather than fractions_transcriber.py
. Because this also includes a lightweight "task queue" that allows long-running computation to go to the background. The task queue implementation there is nearly generalized to any computed feature, so that could be factored out and re-used.
This would also save us the trouble of updating scstudies
schema with something specific to this workflow.
In that case you can reject the PR for ADI schemas and I'll push the changes I made using the fractions pattern in SPT.
Per in-person discussion, we'll be implementing cg-gnn functionality in the apiserver as follows:
If a user wants to use the full feature suite of the cggnn module, they are advised to install the SPT or cg-gnn packages locally and go from there.
Open items
Implementation is dependent on #162 for API refactor before adding a new feature to the API.