cggnn workflow errors if no graphs are created from a specimen

nadeemlab / SPT

Spatial profiling toolbox for spatial characterization of tumor immune microenvironment in multiplex images

https://oncopathtk.org

Other

21 stars 2 forks source link

cggnn workflow errors if no graphs are created from a specimen #260

Closed CarlinLiao closed 11 months ago

CarlinLiao commented 11 months ago

When running on a study with very large and very small specimens (e.g., urothelial), sometimes specimens will be deemed too small to create a specimen for. When that happens, spt cggnn create-specimen-graphs will error because DGL will try to save an empty list of graphs.

A simple but insufficient fix is to change create-specimen-graphs so it doesn't try to save graphs when there are no graphs created from a specimen, but this results in a Nextflow error as it's expecting a graph data artifact to be created, causing an error.

Possible solutions:

Move or duplicate the logic that decides if graphs are created from create-specimen-graphs upstream to prepare-graph-creation. Not my preferred strategy since deciding how many graphs are created happens concurrently with graph creation, but it's possible.
Edit the Nextflow process so it knows that create-specimen-graphs may not create any output.

jimmymathews commented 11 months ago

I think I misunderstand something here. No matter how small, each specimen should not be "deemed too small" to create a graph. Surely it is easy enough to create 1 graph in that case. (Do we actually have such a case?)

CarlinLiao commented 11 months ago

This is intentional, as graphs are not created when specimens are too small, as they don't provide enough information to train a good model. This is the same reason why we don't run cggnn on datasets with smaller slides, like CyTOF or breast cancer IMC.

jimmymathews commented 11 months ago

None of the slides in the urothelial dataset are too small. The smallest that occurs is 2724 cells.

jimmymathews commented 11 months ago

No, we should still be able to run the pipeline in the presence of the "smaller slides" from those datasets. The performance is a separate matter, it does not justify restricting our implementation to the best case scenarios only.

CarlinLiao commented 11 months ago

I set the cells_per_slide_target (actually, this is a misnomer and should be changed to cells_per_ROI_target or something) that we've been using to 5000. If I recall correctly this was a value I arrived at after some observation of which datasets cg-gnn worked well for and which it didn't, after I switched from setting a predetermined ROI size to having it be determined dynamically by average cell density across specimens + a target number of cells I wanted to hit per ROI.

But yes, these are probably two separate issues:

the cggnn workflow errors if no ROIs are created from a specimen, which is intended functionality as-is (this issue)
should there be such a thing as a slide that's too small, and if so are the smallest slides in urothelial that small?

Let's open up a separate issue for the latter, or an email/in-person talking thread if we determine it's unrelated to the SPT codebase.