Intra-sample batches for large datasets

jsxlei / SCALEX

Online single-cell data integration through projecting heterogeneous datasets into a common cell-embedding space

BSD 3-Clause "New" or "Revised" License

71 stars 18 forks source link

Hi Parashar,

Thank you! Acutally SCALEX is indeed takes the way of splitting into batches to train the model. By default the batch_size is 64. In this way SCALEX is super memory-efficient. The most memory consuming part is the preprocesing of data. It will requires loading the whole data, where I chunk them into 20,000 size for each process. In this way, SCALEX is able to handle up to 4 million cells with roughly 100 GB CPU memories. I am also considering taking advantage of backed mode of AnnData in the future. It will greatly reduce the memory.

I am very interested and please let me know how to cooperate with Scarf.

Thank you! Lei

jsxlei / SCALEX

Intra-sample batches for large datasets #6