Supporting `cg-gnn` version 0.2

CarlinLiao commented 11 months ago

I've made a significant upgrade to the cg-gnn pip package with four major changes:

Save graphs when asked as a single graphs.bin, accompanied by a single graph_info.pkl metadata file and a feature_names.txt file that gives interpretable names to the graph data columns, for nicer interplay with Nextflow.
Makes the way graphs are passed between functions internally consistent.
Support fixing a random seed for reproducibility.
Allow graph creation from unlabeled specimens.

The last two changes will impact SPT directly. Supporting the random seed functionality should be as simple as adding another CLI input arg and changing the workflow test to exploit the reproducibility, assuming it works as intended, but the latter may be more involved.

The purpose of a machine learning model is to predict unlabeled data, but, as implemented, when using spt cggnn extract we're only fetching specimen with locked-down strata. For example, the strata of the urolithelial dataset is as follows

   stratum identifier local temporal position indicator                      subject diagnosed condition subject diagnosed result
0                  14               Before intervention  Response to immune checkpoint inhibitor therapy                Responder
1                  15               Before intervention  Response to immune checkpoint inhibitor therapy            Non-responder
15                 16               Before intervention  Response to immune checkpoint inhibitor therapy

Right now when I extract data I simply drop strata 16, but if we could add some functionality so that strata 16 is extracted as "unlabeled" data, that could be one way to naturally enable this functionality. The other study with unknown strata is melanoma CyTOF, which looks slightly different.

   stratum identifier local temporal position indicator subject diagnosed condition subject diagnosed result
0                   9               Before intervention         Response to therapy                Responder
1                  10               Before intervention         Response to therapy            Non-responder
30                 11

Perhaps we identify strata that have more empty values that other strata as "unlabeled"?

I'd consider this issue closed when SPT is updated to use cg-gnn version 0.2.1, which will include all these changes.

CarlinLiao commented 11 months ago

I added the random seed setting to cg-gnn, but after some testing the importance scores still aren't reproducible. More debugging with pytorch revealed that at least one portion of the multi_layer_gnn uses a non-deterministic algorithm on GPU, which could explain the lack of reproducbility.

Something practical that we could target testing instead is whether the same cells are reliably identified as the most important over multiple, fresh runs. That's only possible if the model gets reliable lift from the training data, which might not be possible on the tiny test sets.

That said, once while testing I was able to get a really good performing model (50% accuracy to 75%) on the small melanoma CyTOF dataset, but it was before I started setting a reproducible seed. Either that was a fluke or it is possible to get lift out of the smaller datasets, so long as it's provided the training examples in just the right way. I'm sure it'd be more consistent if we had more and larger specimens, though.

sanadeem commented 11 months ago

I will assume you are using https://pytorch.org/docs/stable/notes/randomness.html. This normally leads to drastic drop in performance. As you suggested, finding consistent cells across multiple, fresh runs is the best way to proceed.

jimmymathews commented 11 months ago

The strict reproducibility is mainly for the testing setup. In the real runs the model performance has more priority than strict reproduction.

Side note: The term is urothelial (not urolithelial).

jimmymathews commented 11 months ago

As for the unlabeled specimens, I don't think it makes sense to try to repurpose the non-focal specimens belonging to one of the data collections as a default input for application of the trained model. The trained model could be used during a possible cross-validation, or as part of a brand new application feature of model application to user-provided single samples from a separate exploratory cohort (not part of the study on which the model was trained). These possible future uses require thoughtful planning and they are not the most urgent items.

nadeemlab / SPT

Supporting `cg-gnn` version 0.2 #232