Is it REALLY necessary to blow up the sparse data to dense?

iancovert / persist

MIT License

14 stars 3 forks source link

Is it REALLY necessary to blow up the sparse data to dense? #5

Closed stela2502 closed 1 month ago

stela2502 commented 7 months ago

I assume that the title might be enough.

In your example 01_persist_supervised.ipynb you have this statement:

# Initialize the dataset for PERSIST
# Note: Here, data_train.layers['bin'] is a sparse array
# data_train.layers['bin'].A converts it to a dense array
train_dataset = ExpressionDataset(adata_train.layers['bin'].A, adata_train.obs['cell_types_25_codes'])
val_dataset = ExpressionDataset(adata_val.layers['bin'].A, adata_val.obs['cell_types_25_codes'])

And of cause if you do not do that it does not work.

Can't this tool be supporting sparse data instead? This does not feel state of the art - sorry.

iancovert commented 7 months ago

Is there a reason you can't perform the conversion to a dense array, would your data become too large to fit in memory? Unfortunately because our algorithm is based on training a neural network and libraries like PyTorch don't support sparse linear layers, I'm not sure there's a way around the sparse -> dense conversion step.

stela2502 commented 7 months ago

Hi Ian, Initially I thought this was the problem, but I could fix that by simply adding more memory. In general I think it is an extremely bad practice to blow up sparse matrices. Would it not be possible to convert line by line while you feed your model? I assume the model itself would not store the 'data' - it should create a model from the data. So in total the memory requirement would be lower. Even if this kind of conversion would take more time - in comparison to the analysis it should be neglectable.

rhngla commented 1 month ago

This is fixed by the linked pull request. Briefly, the conversion now only occurs at the time of loading a batch, without requiring the full dataset to be converted to dense when creating an instance of ExpressionDataset.