Closed zecevicp closed 3 years ago
Hi Petar. I'm a surprised by such a large memory allocation. I have run more than 100k samples in 3D on a laptop a while back, so not sure your problem explodes. The only thing that scales with dimensionality is the covariance matrix of the component (or the data, but you don't use those).
So, my best guess form afar is that multiprocessing
does something nasty. How many cores do you use on your machine? I have created a way that allows the cores to share the data without copying, but that piece of code hasn't been tested since python 2.7 times. If it fails on a machine with lots of cores, you'd quickly run out of memory.
Hi, thanks for the quick reply. I'm using 12 cores. I can try fixing the number of cores to 1. I can also try running the code on Python 2.7. Any other ideas how to work around or debug this?
First try with one core. If that helps, it's at least clear where the problem lies.
Hi, setting one core didn't help. I modified the _mp_chunksize
like this:
# find how many components to distribute over available threads
# cpu_count = multiprocessing.cpu_count()
cpu_count = 1
chunksize = max(1, self.K//cpu_count)
n_chunks = min(cpu_count, self.K//chunksize)
return n_chunks, chunksize
Is that the proper way to do it?
This is just one of the cases where there problem arises. The better option is to set the CPU count for the multiprocessing pool here. With that and your modification to mp_chunksize
, this should avoid the data copying (at least for the fit
method)
Unfortunately, initializing the pool with Pool(1)
doesn't help either. That single process takes all the memory in the system.
I think I have found the problem. The imputation sample is generated internally and by default has 10 times as many samples as the original data. That reduces the noise on logL, but it's overkill here. I managed to get the code to run on 8 cores with as much data as you're using by setting oversampling=1
as argument to fit
.
That fixed it! Brilliant, thanks so much.
I'm trying to use PyGMMis with 5 clusters on a dataset with about 20k samples of 7 dimensions. The result is that the
fit
method eats up all the available memory on the machine (almost 60GB). I wasn't able to callfit
on a dataset with more than 6000 samples.Is this a known problem? Is PyGMMis capable of processing somewhat larger datasets? Is there a configuration option I am missing?
I am running this on Python 3.7.6. with something like:
Thanks!