Closed parashardhapola closed 1 year ago
Hi Parashar,
Thank you! Acutally SCALEX is indeed takes the way of splitting into batches to train the model. By default the batch_size is 64. In this way SCALEX is super memory-efficient. The most memory consuming part is the preprocesing of data. It will requires loading the whole data, where I chunk them into 20,000 size for each process. In this way, SCALEX is able to handle up to 4 million cells with roughly 100 GB CPU memories. I am also considering taking advantage of backed mode of AnnData in the future. It will greatly reduce the memory.
I am very interested and please let me know how to cooperate with Scarf.
Thank you! Lei
Hi @jsxlei,
Fantastic work, and congratulation on the publication! 🎊
I'm trying to fully leverage the memory efficiency of SCALEX and wonder if SCALEX can be used to train a single large data (say with 1 million cells) but in batches (say 10000 cells at a time). So, in this case, there be 100 online training events. Have you tried something like this in-house?
We have built Scarf, a very memory-efficient pipeline, and SCALEX would be a perfect addition.
/PD