jsxlei / SCALEX

Online single-cell data integration through projecting heterogeneous datasets into a common cell-embedding space
BSD 3-Clause "New" or "Revised" License
71 stars 18 forks source link

Intra-sample batches for large datasets #6

Closed parashardhapola closed 1 year ago

parashardhapola commented 1 year ago

Hi @jsxlei,

Fantastic work, and congratulation on the publication! 🎊

I'm trying to fully leverage the memory efficiency of SCALEX and wonder if SCALEX can be used to train a single large data (say with 1 million cells) but in batches (say 10000 cells at a time). So, in this case, there be 100 online training events. Have you tried something like this in-house?

We have built Scarf, a very memory-efficient pipeline, and SCALEX would be a perfect addition.

/PD

jsxlei commented 1 year ago

Hi Parashar,

Thank you! Acutally SCALEX is indeed takes the way of splitting into batches to train the model. By default the batch_size is 64. In this way SCALEX is super memory-efficient. The most memory consuming part is the preprocesing of data. It will requires loading the whole data, where I chunk them into 20,000 size for each process. In this way, SCALEX is able to handle up to 4 million cells with roughly 100 GB CPU memories. I am also considering taking advantage of backed mode of AnnData in the future. It will greatly reduce the memory.

I am very interested and please let me know how to cooperate with Scarf.

Thank you! Lei