GNNs - Githubissues

jimmymathews commented 1 year ago

This is graph neural networks modeling and statistics, to be implemented as a new computational-visitor-patterned workflow in SPT. It will start from any dataset available in an ADI-structured database, perform training / model fitting on each available outcome, and send computed features into the database.

May be convenient to complete #83 first as a warmup.

Acceptance

A new workflow is added.
The workflow directory contains a module test and at least one unit test.
Pass the test suite.
The trained model gets saved somewhere with some minimal amount of metadata around it describing the circumstances of its birth.
Tried on a major exemplar dataset, results inspected.

CarlinLiao commented 1 year ago

I'm maintaining the GNN functionality in an independent repository (on the MSK GitHub, but it doesn't have to be there) that I'm continuing to develop.

We want to merge those features into a sub-sub-directory in the SPT workflows module, ideally while still allowing for pulling additional changes from the original GNN repo, and to update all your docker dependencies etc. to support it. This comes with two main issues

Git supports merging two independent histories together, but not so much merging one repo into a subdirectory of another repo. There does exist a tool to support this, but it'd create a weird and niche development dependency.
The GNN workflow's dependencies are a bit different than the rest of SPT's and I'm not sure how far we can go to isolate them. For example, it requires Python 3.9 instead of 3.10 and would like (but doesn't require) CUDA GPU support.

To do the merge, I have to mess with the directory structure in the original GNN repo, in a way that's less compatible with its independent usage, not to mention the need to add files like core, initializer, and integrator to follow the SPT workflow.

We could just wait until I'm done building out the GNN module before we work on merging it in, but that doesn't allow for parallel testing, and the feature is mostly ready to integrate anyway, barring a few additional metric analysis and visualization add-ons.

(Speaking of, I'm not sure what the intent is to do with the GNN feature. Do we really just want the trained model, or do we also want to output some nice graphs and scoring? Identify more important features? I was under the impression that we weren't really concerned about the actual trained model at all, tbh.)

CarlinLiao commented 1 year ago

Perhaps we should reiterate and clarify the scope of the GNN feature:

Train a model on graphs created from histological structures (i.e., cells only and not tissue), with multiple graphs (ROIs) per slide, against a slide classification of the user's choice
Using the trained model, identify cell importance and visualize patterns in feature activation across each cell graph for qualitative explainability purposes. (How should these visualizations be output? Images, interactive HTML files?)
Possibly try incorporating some quantitative metrics from prior research as well?
Infer new entries? Are we planning on using the model again on new data?

The more detail we can settle on, the more I can strip down the module to essential functionality. Maybe this is a better conversation to have in person.

jimmymathews commented 1 year ago

Migrate to SPT? With the recent modularization work, I think SPT is prepared to have reasonably isolated submodules. To be extra careful about it, you can make a new module gnn at first rather than a new "workflow". That is, a subdirectory sibling to workflow, db, countsserver, etc. Then register Python dependencies in pyproject.toml's [project.optional-dependencies], as in pip install spatialprofilingtoolbox[gnn], and other kinds of dependencies can be installed as directed in a Dockerfile for an spt-gnn Docker image. By the way pyproject.toml is currently asking for requires-python = ">=3.9", so the Python version shouldn't be a problem. To start out you can follow the pattern of the other modules in defining scripts/ with specific functionality meant to be run at the CLI in the form spt gnn train --database-config-file ... or similar. (One request here is to try never to use directories in dealing with file-system-materialized inputs or outputs; just flat list of files.) By doing so these functions can be available to a new "workflow" when we are ready to write it. Based on this commentary, is there any reason not to move everything into the SPT repository?

What to do with results. The imminent use is adding to our store of "computed features" that will eventually be readily available (no matter what type) for inspection against covariates in the SPT dashboard. Thus this involves extracting intermediate numeric features from evaluation of the model on samples, which should implicitly have something to do with the prescribed outcome variable. I'm not sure exactly what these models provide, so the exact extraction will be up to you.

Visualizations. This first export described above is sufficient for this issue, I think. An additional effort will be needed to create visualizations, though you are welcome to attempt this as well if think it will not be too difficult. For this it's best to have the raw images readily at hand, and I'm not sure that's quite true for our exemplar datasets. (We have them, but not "readily at hand").

jimmymathews commented 1 year ago

So I would say, as for the scope clarifications:

Yes, training as you've described.
Qualitative explainability maybe postponed.
Incorporating prior research metrics if easy, but not required.
Save the model for future reference, but currently no imminent plans for re-use.

jimmymathews commented 1 year ago

The general pattern can be to checkout an issue branch, e.g. issue85, and work a lot in there until ready for a PR.

CarlinLiao commented 1 year ago

Let's consider these "computed features". A fully complete GNN workflow will create

A single trained model for an entire study (complete in independent repo)
Importance scores for graph classification for each cell/histological structure (complete)
Graph visualizations of groups of cells/HSs as images or HTML interactives (in progress)
"Class separability" between features or groups of features. Higher values denote that a feature is more important for classification. Complexity increases with the number of classes as it's calculated pairwise across all output classes, as well as for the user as it's up to them to define "groups of features" as those aren't defined in the SPT schema. (not yet in progress)

How many of these line up with your vision of what a "computed feature" is?

jimmymathews commented 1 year ago

Ah none of those are what I meant. I meant feature in the sense of statistics (page xi), i.e. a numeric value for each element of the sample set. A GNN is in particular a neural network, a neural network has nodes, and nodes get activation values upon sample presentation. I was thinking from the point of view of, for example, autoencoders, where a handful of N intermediate nodes provides an N-dimensional representation of the sample space upon application of the encoder. N useful features. Is that not what we're doing? I'm assuming we're getting some final ROC curve at the end of all this modeling, based on some best feature for discriminating the binary outcome.

It's not a great example of something defineable within a true ontology, but "Feature specification" and "Feature specifier" and "Quantitative feature value" are the tables in the schema that will receive such feature values.

The second thing you mentioned, importance scores at the cell level, could be used to derive a couple of useful features. E.g. the fraction of "important cells" (according to a threshold) that are positive for each of the markers, or belong to each of the pre-defined phenotypes.

CarlinLiao commented 1 year ago

What do we consider to be the sample set here? Cells/histological structures, specimens, or something else?

The GNN as set up isn't intended to have its intermediate values be pulled out for consumption, although pulling out importance scores is doing so indirectly. At present, the model's set up to output inferred classification only, but a slight tweak will allow us to get the probabilistic values we can use to construct an ROC curve. That will require a withheld test set, which as we discussed with Saad we'll struggle with at such small sample sizes. Even if we were to do so, an ROC curve isn't as much a summary statistic as an entire plot evaluating the performance of the GNN.

The funny thing is that the importance threshold is ill-defined. In the source paper, the authors kind of throw up their hands and just test a couple different hard cutoffs for important cell counts, ranging from 5 to 50, to use in their quantitative analysis.

Per the schema, the features as defined in the "Feature specification" and associated tables are only attached to their source data by the study.

How would we identify the study the GNN derived features come from since they're reliant on 3 different studies/sub-studies? (This probably goes back to our earlier discussion about batching studies.)
These features can be any shape we want since they're so loosely attached to the source data, but the quantitative feature value might get messy and require a lot of grouping to be usable.

Functionally I'd say that having the GNN output the quantitative features of importance scores per cell and separability per ~feature~ target/phenotype would be the most natural fit into the schema laid out, although we should still concern ourselves with visual graph generation since that'll be most intuitive for us to analyze and nice for the paper submission.

And this isn't yet going into creating GNN variants that include only biological markers or only phenotypes in addition to both...

jimmymathews commented 1 year ago

Very good comments. The schema specification is deliberately a bit ambiguous about what counts as the sample set for the purpose of "Feature specifications". But we can get some guidance from the purpose of such features, which I intended to be:

a summary of what a given computational workflow is able to tell a person invested in the dataset about that dataset,
in a format which is readily compared across computational workflows,
that is subjectable to manual inspection/verification/interpretation against covariates (outcome) by an investigator familiar with the biology.

Kind of a "lowest common denominator".

The 3 a priori possibilities apparent to me for the sample set are: cells/histological structures, specimens/slides, and subjects/patients.

Cells more or less satisfy intentions 1, 2, and 3, except the "manual inspection" part, which is not quite tractable with so many millions of cells. If cells are the sample set a recommended aggregation procedure to specimen level would probably be required. (It seems to me that in the case we are considering, "importance" quantification or something, for a cell as a node of a graph fed to the GNN, does not suggest an obvious meaningful aggregation procedure to specimen level.)

Subjects/patients should probably be our default meaning for sample set. In cases where these are simple mean aggregations over specimen sets, a specimen-level feature could perhaps be provided by default as well. In the definition of "Quantitative feature value", the "Subject" field is defined as The entity that was subjected to the feature derivation or quantification process. So an identifier that adequately identifies a specimen or a subject is fine here, I think.

I agree with you that the most natural fit with the schema is something like quantitative features of importance scores per cell and separability per feature target/phenotype. Perhaps importance-score-weighted expression levels for each phenotype/channel/target? This would seem to remove the need for a threshold. I'm a little unclear on how the node features are usable or not usable in GNN modeling. Is the "separability" you're referring to related to some kind of parallel training, doing training separately for each phenotype/channel?

I agree that we should still concern ourselves with visual graph generation, but I'm not sure how much effort should go into making this an automatic figure generation for a UI element in the SPT dashboard right away; this seems to be an additional independent effort.

To answer your question 1:

I think the GNN analysis is a data analysis study. The correct thing to do here is probably to register a new data analysis study, and also register it as a component of the larger "project" (this element to appear soon as an upgrade to the current schema).

jimmymathews commented 1 year ago

* There is an independent practical reason to avoid cells as the sample set: Storage space in the database. Currently the huge bulk of storage is taken up by the cell-level quantifications, just because of the sheer number of entries. Adding another such bulky unit for every single feature output by a given workflow is probably too greedy.

CarlinLiao commented 1 year ago

Cell importance. The point about cell importance being per graph is a key one. There are multiple ROIs generated per specimen, they sometimes overlap, and not ever cell is included in an ROI if it's not near a tumorous cell. We would struggle to generate an applicable feature out of these. These are probably best left as a computational tool and not a recorded metric.
Separability scores, provided I can get them working, would be per GNN input feature (i.e., channel/target/phenotype) and pair of classes the GNN is inferring. They're meant to compare different feature sets ("concepts") and explainer mechanisms. It's not parallel, but one shot, taking in all the ROIs at once (usually the test set) and then calculating separability based off the entire dataset.
Visual graph generation is ready to go actually. The pipeline generates an html file (presumably I can get the raw JS too) that can be directly embedded into a web application if you so choose. There's one of these per graph/ROI, not specimen, so their utility might not be scalable.

CarlinLiao commented 1 year ago

Addendum on separability: I've hit a roadblock trying to replicate the histocarto team's separability calculations on their own workflow, never mind adapting theirs to ours. I've opened up a few tickets to try and get that resolved, but for now I think we should move forward with visualizations only.

jimmymathews commented 1 year ago

Ok, let us do that then -- visualizations only. Can you write here the details of what you think is feasible to include in this visualization?

I guess will have to revise my aim to integrate the GNN workflow in the SPT dashboard in a seamless way. If it is ever integrated, it sounds like it will have very special needs with respect to UI presentation.

Based on this I'm struggling to see how any output of the GNN computation will be even slightly useful to the investigators who collect and study the dataset. Can you try to craft such a "use" narrative? I tried to outline one above (those 3 intentions), but my narrative seems unachievable now.

CarlinLiao commented 1 year ago

I can send you an example of the visualization over email. I have it set to visualize the entire cell network of an ROI, with cells sized and colored according to their importance. On hovering over a cell, the tooltip shows the histological structure ID of the cell, as well as which channels/targets/phenotypes are activated. If there's more information available I could pack it into the visualization too, but I can't think of any. Specimen and outcome are the same across an entire ROI, and it would be tough to combine ROIs because the importance scores wouldn't be compatible, as mentioned earlier.

The actual model being created could be applied to future pathology slides in the same vein that don't have results already, and we ultimately want to pull out those quantitative metrics once I can get that figured out.

jimmymathews commented 1 year ago

Ah ok, I was under the impression you were just talking about the graph (nodes with locations and edges). So it is feasible that a pathologist might be interested in trying to discern a pattern in the apparent importance score distribution that occurs in the given cases (i.e. outcome). Based on a visual inspection.

CarlinLiao commented 1 year ago

After discussion, the necessary changes or additions to the GNN workflow before inclusion into SPT are as follows:

Refactor code to strip out remaining, incomplete support for tissue graphs
- Try to remove the need for model config files as much as possible
Reduce model training to use only chemical species inputs (alternatively, make this optional)
Integrate graph visualization into stack
- Add dropdown to change cell coloring based on input feature value
- Stitch together ROIs from the same specimen into a single graph for visualization, and apply label
- Update model input feature references to be named instead of indexed (e.g., columns named "oxide" instead of "1")
Implement quantitative model performance scores, i.e., similarity scores, on phenotypes
- Each phenotype will be a "concept" with only one "attribute", itself
- Could also explore treating phenotype concepts as groups of chemical species, aligned or not, but probably not

jimmymathews commented 1 year ago

Note: the "concept" and "attribute" terminology refers to usage in the histocartography paper.

This is a great breakdown, thank you!

jimmymathews commented 1 year ago

After further discussion of the internals of histocartography, it seems that my earlier comment:

The second thing you mentioned, importance scores at the cell level, could be used to derive a couple of useful features. E.g. the fraction of "important cells" (according to a threshold) that are positive for each of the markers, or belong to each of the pre-defined phenotypes.

almost exactly describes what they are doing in histocartography.

They consider, for a series of values of importance score thresholds (a rank/integer cutoff k), the distribution of the values of a given attribute (e.g. phenotype assignment function with values 0 or 1). They restrict to the samples belonging to given outcome classes, then compare such distributions by computing Wasserstein distance d.

They are also doing a little extra: regarding the resulting distance values for various k as describing an ROC curve (not quite right, kind of an analogy) in order to compute an AUC. They call this a separability score.

This is fine, but the extra step is not necessary for us and introduces an additional barrier to interpretation of the resulting summary metric.

I proposed that we omit the AUC steps and just report, for each phenotype and binary outcome, the pair (k, d) for which d is largest. Of course we also do not have to use Wasserstein distance, d could instead be difference in means (typical "effect size"), or else a p-value for a t-test or similar.

Note that there are indeed "specimen-level features" available, as I had hoped. Namely, for each phenotype, the mean of the distribution described above when restricted to a given specimen/slide. In the terminology of the schema, these features could be described by 2 enumerated specifiers:

phenotype of restriction (e.g. "T cell", "B cell", ...)
importance score rank threshold (that is, k, e.g. 1, 2, ...)

There is also already room in the schema to record those p-values, namely in Two-cohort feature association test. Selection criterion 1 and selection criterion 2 would be the two labels of a binary outcome, and the test could be "t-test", the p-value is the p-value, and the feature tested the given feature specification.

jimmymathews commented 1 year ago

The above pre-calculations could be superseded entirely with an interactive variant, in which a slider controls the k, and the visualization masks cells with importance rank beyond the chosen k, and the distribution means and t-tests etc. are performed on-the-fly. This would give the UI operator more control over the exploration and enable discovery.

Let's keep this interactive version in mind for the future!

CarlinLiao commented 1 year ago

A quick thought: we discussed releasing the binary 0-1 chemical species value conversion to make the separability calculations more meaningful, but after more thought I've realized that even if the chemical species have continuous values between 0 and 1, we calculate separability scores per concept and those don't have an equivalent float representation.

CarlinLiao commented 1 year ago

All of the changes detailed in this post have been completed and pushed to the GNN workflow repo. The only step left before we can integrate it into SPT is validating that it works on a dataset that isn't the Dr. Hollman's melanoma data.

CarlinLiao commented 1 year ago

Additional features requested

Unified importance scores across all ROIs from a single specimen, weighting by model inferred label confidence
Separability scores for chemical species in addition to advanced phenotypes

CarlinLiao commented 1 year ago

Separability scores for chemical species added, unification of importance in process—implementing this called to focus a bug and an oddity with inference that required additional fixes. Two more changes were requested to bring the total outstanding to 3.

Unified importance scores across all ROIs from a single specimen, weighting by model inferred label confidence
Add condition that hist structures being pulled are only cells
Build ROIs around user supplied phenotype class name (optional for the user)

nadeemlab / SPT

GNNs #85

Acceptance