nadeemlab / cg-gnn

Create cell graphs from pathology slide data and train a graph neural network to predict patient outcomes for SPT.
GNU Affero General Public License v3.0
4 stars 0 forks source link

Memory demands are too high #5

Closed CarlinLiao closed 10 months ago

CarlinLiao commented 1 year ago

This issue was masked by the HPC we usually run this on, but the memory demands of this package are very high. For example, for lesion 0_1 in the Melanoma intralesional IL2 dataset, I get an error

ArrayMemoryError
Unable to allocate 25.8 GiB for an array with shape (58896, 58896) and data type float64

at this line https://github.com/CarlinLiao/cg-gnn/blob/c46fbbce52a376a5597b32bd2410da30e68d35f4/cggnn/generate_graph_from_spt.py#L47-L48

On a weaker local machine, it also tends to crash during the first large SQL query to get all cells here too it seems. https://github.com/CarlinLiao/cg-gnn/blob/c46fbbce52a376a5597b32bd2410da30e68d35f4/cggnn/spt_to_df.py#L20-L41

CarlinLiao commented 1 year ago

Memory demands are so high because

  1. There are a lot of cells in a single slide, particularly for the "Melanoma IL2" dataset.
  2. The algorithm calculates the distance between every cell and every other cell, and then does percentile calculations to find where the slide is densest (as implemented, which cell is the closest to the most other cells), so it can place the next ROI.

https://github.com/CarlinLiao/cg-gnn/blob/c46fbbce52a376a5597b32bd2410da30e68d35f4/cggnn/generate_graph_from_spt.py#L56-L58

As I understand it, this approach means that the distance calculation can't be done on the fly with something like a KDTree, as is done in SPT, because of that ranking mechanism. Maybe you have thoughts @jimmymathews?

CarlinLiao commented 10 months ago

Resolved by converting the square matrix distance calculation to KDTree.