Memory usage needs to be optimized

liubovpashkova commented 4 years ago

I have been running CELLEX on a huge dataset with ~1.3kk cells recently. To my surprise, I encountered the following error:

MemoryError: Unable to allocate array with shape (1331984, 26182) and data type float64

Thus, the server has not enough memory to complete the task if the expression matrix is stored as float64 (by default). CELLEX consumes > 50% of RAM (more than 1 TB) and then the analysis inextricably stops.

2 developers: is it really necessary to store the expression matrix as float64? This super high precision is relevant? Are you sure that float32 is not sufficient?

2 users: I was able to solve the problem by converting my gene expression matrix (the variable data in the tutorial) from the default data type float64 to float32 before creating ESObject as follows

data_float32 = data.astype(np.float32)

Don’t forget to delete the variables after (we need to save the Yggdrasil’s RAM):

del data
del data_float32

tstannius commented 4 years ago

I'm planning to convert data input to np.float32 by default, but allowing the user to specify an alternative format via a datatype argument. I just need to figure out a future proof way of doing so – I am currently considering adding a kwargs argument to ESObject.

tstannius commented 4 years ago

Possibly solved by release v1.1.0. I will have to do some testing to find out if we need more aggressive changes.

perslab / CELLEX

Memory usage needs to be optimized #14