pmelchior / pygmmis

Gaussian mixture model for incomplete (missing or truncated) and noisy data
MIT License
98 stars 22 forks source link

High memory usage #17

Closed zecevicp closed 3 years ago

zecevicp commented 3 years ago

I'm trying to use PyGMMis with 5 clusters on a dataset with about 20k samples of 7 dimensions. The result is that the fit method eats up all the available memory on the machine (almost 60GB). I wasn't able to call fit on a dataset with more than 6000 samples.

Is this a known problem? Is PyGMMis capable of processing somewhat larger datasets? Is there a configuration option I am missing?

I am running this on Python 3.7.6. with something like:

def selcallback(smpls):
            return np.all(np.all([smpls <= maxs, smpls >= mins], axis=0), axis=1)

gmm = GMM(K=5, D=7)
fit(gmm, samples, sel_callback=selcallback, maxiter=20, tol=0.001)

Thanks!

pmelchior commented 3 years ago

Hi Petar. I'm a surprised by such a large memory allocation. I have run more than 100k samples in 3D on a laptop a while back, so not sure your problem explodes. The only thing that scales with dimensionality is the covariance matrix of the component (or the data, but you don't use those).

So, my best guess form afar is that multiprocessing does something nasty. How many cores do you use on your machine? I have created a way that allows the cores to share the data without copying, but that piece of code hasn't been tested since python 2.7 times. If it fails on a machine with lots of cores, you'd quickly run out of memory.

zecevicp commented 3 years ago

Hi, thanks for the quick reply. I'm using 12 cores. I can try fixing the number of cores to 1. I can also try running the code on Python 2.7. Any other ideas how to work around or debug this?

pmelchior commented 3 years ago

First try with one core. If that helps, it's at least clear where the problem lies.

zecevicp commented 3 years ago

Hi, setting one core didn't help. I modified the _mp_chunksize like this:

        # find how many components to distribute over available threads
        # cpu_count = multiprocessing.cpu_count()
        cpu_count = 1
        chunksize = max(1, self.K//cpu_count)
        n_chunks = min(cpu_count, self.K//chunksize)        
        return n_chunks, chunksize

Is that the proper way to do it?

pmelchior commented 3 years ago

This is just one of the cases where there problem arises. The better option is to set the CPU count for the multiprocessing pool here. With that and your modification to mp_chunksize, this should avoid the data copying (at least for the fit method)

zecevicp commented 3 years ago

Unfortunately, initializing the pool with Pool(1) doesn't help either. That single process takes all the memory in the system.

pmelchior commented 3 years ago

I think I have found the problem. The imputation sample is generated internally and by default has 10 times as many samples as the original data. That reduces the noise on logL, but it's overkill here. I managed to get the code to run on 8 cores with as much data as you're using by setting oversampling=1 as argument to fit.

zecevicp commented 3 years ago

That fixed it! Brilliant, thanks so much.