`2022-06-23 09:25:59.298713: W tensorflow/core/framework/cpu_allocator_impl.cc:82] Allocation of 19113388800 exceeds 10% of free system memory.`

bschilder commented 2 years ago

I'm currently trying to train GenoCAE on a datasets of 15 million SNPs across 67 individuals, but seem to be running into memory issues despite the fact that I'm using a AMD Threadripper workstations with 252 GB of memory and 64 cores (128 threads).

I suspect this may be due to the large number of SNPs I'm including, since the example data (which runs fine) only contains 9,259 SNPS, and in the original paper 161k were used.

2,067 individuals typed at 160,858

From my limited experience with these models, the number of input features drastically affect memory usage (much moreso than sample size). So I think my first step will be to filter the number of variants I'm training the model on based on some of the guidelines provided in the paper:

Remove sex chromosomes
Set missing genotypes "to the most frequent value per SNP so as to avoid their influence over dimensionality reduction results".
Remove SNPS with MAF <1%.
Perform LD pruning by "removing one of each pair of SNPs in windows of 1.0 centimorgan that had an allelic R2 value greater than 0.2." Though eventually I'd like to find a way to avoid this last step because I'm interested in identifying causal variants.

cnettel commented 2 years ago

I'm sorry for not following up on your earlier issues, but I can comment briefly on this one. Yes, you would most likely want to filter your SNP set. Depending on your intended usage modes, LD pruning can make sense, MAF pruning deifnitely can (because, after all, any SNP that is almost fixed adds limited information, except for the occasional carrier). The TensorFlow memory usage should be agnostic of population size, while the Python process allocates a buffer for all the data at one point. We're working on an internal version that does away with that, but that's not ready for public release. And, again, your actual error is most likely caused by the size of the instantiated network, and that's only dependent on batch size and the number of SNPs. Hence, decreasing batch size can be another option, in combination with filtering the SNP set. Internally, we already have an implementation that can distribute the batch across several TensorFlow processors, including on separata machines.

Again, depending on your usage mode, you might also want to try out doing just a subregion, e.g. (part of) a chromosome, to see what kind of results you get. For overall population structure, this would not be the way to go, but we've sometimes done such comparisons to understand the difference in genotype reconstruction ability depending on what kind of data we train the model on. With such a small training set, I think the benefits of using a very high number of SNPs will be limited anyway.

In short, we are aware of scaling issues and we are working on addressing them, but here and now I would suggest looking into a reduced SNP set, possibly based on criteria similar to what we used in the paper, and/or reducing the batch size. The SNP set used in the paper was, however, mainly chosen in order to do a fair comparison against PCA, i.e. filtering according to best practices for PCA and then see how well our approach could do, even when given such data.

bschilder commented 2 years ago

Thanks so much for the detailed feedback @cnettel ! This all makes quite a bit of sense to me.

I'll tinker with the SNP filtering based on your suggestions. And let me know if you need a tester for the new version of GenoCAE, I'd be happy to give it a go!

bschilder commented 2 years ago

Reducing my variants down to 150k now works great, no memory issues and completes 20 epochs within minutes. I'll keep playing around with the exact filtering strategy to figure out what works best.

Thanks again!

kausmees / GenoCAE

`2022-06-23 09:25:59.298713: W tensorflow/core/framework/cpu_allocator_impl.cc:82] Allocation of 19113388800 exceeds 10% of free system memory.` #33