Closed CarlinLiao closed 11 months ago
I added the random seed setting to cg-gnn
, but after some testing the importance scores still aren't reproducible. More debugging with pytorch revealed that at least one portion of the multi_layer_gnn
uses a non-deterministic algorithm on GPU, which could explain the lack of reproducbility.
Something practical that we could target testing instead is whether the same cells are reliably identified as the most important over multiple, fresh runs. That's only possible if the model gets reliable lift from the training data, which might not be possible on the tiny test sets.
That said, once while testing I was able to get a really good performing model (50% accuracy to 75%) on the small melanoma CyTOF dataset, but it was before I started setting a reproducible seed. Either that was a fluke or it is possible to get lift out of the smaller datasets, so long as it's provided the training examples in just the right way. I'm sure it'd be more consistent if we had more and larger specimens, though.
I will assume you are using https://pytorch.org/docs/stable/notes/randomness.html. This normally leads to drastic drop in performance. As you suggested, finding consistent cells across multiple, fresh runs is the best way to proceed.
The strict reproducibility is mainly for the testing setup. In the real runs the model performance has more priority than strict reproduction.
Side note: The term is urothelial (not urolithelial).
As for the unlabeled specimens, I don't think it makes sense to try to repurpose the non-focal specimens belonging to one of the data collections as a default input for application of the trained model. The trained model could be used during a possible cross-validation, or as part of a brand new application feature of model application to user-provided single samples from a separate exploratory cohort (not part of the study on which the model was trained). These possible future uses require thoughtful planning and they are not the most urgent items.
I've made a significant upgrade to the
cg-gnn
pip package with four major changes:graphs.bin
, accompanied by a singlegraph_info.pkl
metadata file and afeature_names.txt
file that gives interpretable names to the graph data columns, for nicer interplay with Nextflow.The last two changes will impact SPT directly. Supporting the random seed functionality should be as simple as adding another CLI input arg and changing the workflow test to exploit the reproducibility, assuming it works as intended, but the latter may be more involved.
The purpose of a machine learning model is to predict unlabeled data, but, as implemented, when using
spt cggnn extract
we're only fetching specimen with locked-down strata. For example, the strata of the urolithelial dataset is as followsRight now when I extract data I simply drop strata 16, but if we could add some functionality so that strata 16 is extracted as "unlabeled" data, that could be one way to naturally enable this functionality. The other study with unknown strata is melanoma CyTOF, which looks slightly different.
Perhaps we identify strata that have more empty values that other strata as "unlabeled"?
I'd consider this issue closed when SPT is updated to use
cg-gnn
version 0.2.1, which will include all these changes.