pmelchior / pygmmis

Gaussian mixture model for incomplete (missing or truncated) and noisy data
MIT License
98 stars 22 forks source link

Divide by zero warning? #8

Closed Jieyu-Wang closed 6 years ago

Jieyu-Wang commented 6 years ago

Hi Peter,

Thank you for developing the powerful tool. Unfortunately I have some issue while applying it to my data, and could you please take a look on it? My python version is 3.6.2 and all my site packages are up to date.

gmm_logL(simu,20)

The function gmm_logL is:

def gmm_logL(simu, K = 10):     #calculate log likelihood of GMM fitting for different K
    logL = []
    if type(K) is list:
        K_list = K
    else:
        K_list= list(np.arange(1,K+1))
    for i in K_list:
        start = time.time()
        try:
            gmm = pygmmis.GMM(K=i, D=len(simu[0]))
            logl, U = pygmmis.fit(gmm, simu, w = 1e-6)
            logL.append(logl)
        except Exception and RuntimeWarning:
            print('There is an error with K = %d' %(i))
            K_list.remove(i)

        print("Time used for K = %d is %gs, min amplitude is %g" % (i,(time.time() - start),np.min(gmm.amp)))
    plt.plot(K_list, logL, linewidth = 5)
    plt.title('Log likelihood of GMM fitting')
    plt.xlabel('$K$')
    plt.ylabel('$log-likelihood \lnU$')
    plt.show()

and the file for simu is as below: simulation100000.txt

you can simply use simu = np.loadtext('simulation100000.txt') to load it.

I ran the code as above in ipython console and the trace back log is as below:

D:\Anaconda\lib\site-packages\pygmmis.py:910: RuntimeWarning: divide by zero encountered in log
  log_S[H] = np.log(log_S[H])

D:\Anaconda\lib\site-packages\pygmmis.py:153: RuntimeWarning: invalid value encountered in add
  return np.log(np.exp(logX + c[c_shape]).sum(axis=axis)) - c

D:\Anaconda\lib\site-packages\pygmmis.py:766: RuntimeWarning: invalid value encountered in greater
  moved = np.flatnonzero(shift2 > shift_cutoff)

D:\Anaconda\lib\site-packages\numpy\linalg\linalg.py:1741: RuntimeWarning: invalid value encountered in slogdet
  sign, logdet = _umath_linalg.slogdet(a, signature=signature)
D:\Anaconda\lib\site-packages\numpy\linalg\linalg.py:1741: RuntimeWarning: invalid value encountered in slogdet
  sign, logdet = _umath_linalg.slogdet(a, signature=signature)
D:\Anaconda\lib\site-packages\numpy\linalg\linalg.py:1741: RuntimeWarning: invalid value encountered in slogdet
  sign, logdet = _umath_linalg.slogdet(a, signature=signature)
D:\Anaconda\lib\site-packages\numpy\linalg\linalg.py:1741: RuntimeWarning: invalid value encountered in slogdet
  sign, logdet = _umath_linalg.slogdet(a, signature=signature)

above are the runtime warnings I've got while running gmm_logL(simu,20), and it stopped at here seems to fall into a infinite loop for more than 2 hour. So I pressed ctrl+c to stop it: and it didn't stop directly but stop after generating these information:

^C
Process SpawnPoolWorker-7:
Process SpawnPoolWorker-6:
Process SpawnPoolWorker-5:
Process SpawnPoolWorker-3:
Process SpawnPoolWorker-2:
Process SpawnPoolWorker-1:
Process SpawnPoolWorker-4:
Process SpawnPoolWorker-8:
Traceback (most recent call last):
  File "D:\Anaconda\lib\multiprocessing\process.py", line 249, in _bootstrap
    self.run()
  File "D:\Anaconda\lib\multiprocessing\process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "D:\Anaconda\lib\multiprocessing\pool.py", line 119, in worker
    result = (True, func(*args, **kwds))
  File "D:\Anaconda\lib\multiprocessing\pool.py", line 44, in mapstar
    return list(map(*args))
  File "D:\Anaconda\lib\site-packages\parmap\parmap.py", line 116, in _func_star_many
    **func_items_args[3])
  File "D:\Anaconda\lib\site-packages\pygmmis.py", line 932, in _Esum
    dx = d_ - gmm.mean[k]
KeyboardInterrupt
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
  File "D:\Anaconda\lib\multiprocessing\process.py", line 249, in _bootstrap
    self.run()
  File "D:\Anaconda\lib\multiprocessing\process.py", line 249, in _bootstrap
    self.run()
Traceback (most recent call last):
  File "D:\Anaconda\lib\multiprocessing\process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "D:\Anaconda\lib\multiprocessing\process.py", line 249, in _bootstrap
    self.run()
  File "D:\Anaconda\lib\multiprocessing\process.py", line 249, in _bootstrap
    self.run()
  File "D:\Anaconda\lib\multiprocessing\process.py", line 249, in _bootstrap
    self.run()
  File "D:\Anaconda\lib\multiprocessing\pool.py", line 108, in worker
    task = get()
  File "D:\Anaconda\lib\multiprocessing\process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "D:\Anaconda\lib\multiprocessing\queues.py", line 341, in get
    with self._rlock:
  File "D:\Anaconda\lib\multiprocessing\process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "D:\Anaconda\lib\multiprocessing\process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "D:\Anaconda\lib\multiprocessing\pool.py", line 108, in worker
    task = get()
  File "D:\Anaconda\lib\multiprocessing\process.py", line 249, in _bootstrap
    self.run()
  File "D:\Anaconda\lib\multiprocessing\synchronize.py", line 96, in __enter__
    return self._semlock.__enter__()
  File "D:\Anaconda\lib\multiprocessing\pool.py", line 108, in worker
    task = get()
  File "D:\Anaconda\lib\multiprocessing\pool.py", line 108, in worker
    task = get()
  File "D:\Anaconda\lib\multiprocessing\queues.py", line 341, in get
    with self._rlock:
  File "D:\Anaconda\lib\multiprocessing\process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "D:\Anaconda\lib\multiprocessing\pool.py", line 108, in worker
    task = get()
  File "D:\Anaconda\lib\multiprocessing\queues.py", line 341, in get
    with self._rlock:
  File "D:\Anaconda\lib\multiprocessing\process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "D:\Anaconda\lib\multiprocessing\queues.py", line 341, in get
    with self._rlock:
  File "D:\Anaconda\lib\multiprocessing\queues.py", line 341, in get
    with self._rlock:
  File "D:\Anaconda\lib\multiprocessing\synchronize.py", line 96, in __enter__
    return self._semlock.__enter__()
KeyboardInterrupt
  File "D:\Anaconda\lib\multiprocessing\synchronize.py", line 96, in __enter__
    return self._semlock.__enter__()
  File "D:\Anaconda\lib\multiprocessing\pool.py", line 108, in worker
    task = get()
  File "D:\Anaconda\lib\multiprocessing\synchronize.py", line 96, in __enter__
    return self._semlock.__enter__()
KeyboardInterrupt
  File "D:\Anaconda\lib\multiprocessing\synchronize.py", line 96, in __enter__
    return self._semlock.__enter__()
KeyboardInterrupt
KeyboardInterrupt
KeyboardInterrupt
  File "D:\Anaconda\lib\multiprocessing\queues.py", line 341, in get
    with self._rlock:
  File "D:\Anaconda\lib\multiprocessing\synchronize.py", line 96, in __enter__
    return self._semlock.__enter__()
KeyboardInterrupt
Traceback (most recent call last):
  File "D:\Anaconda\lib\multiprocessing\process.py", line 249, in _bootstrap
    self.run()
  File "D:\Anaconda\lib\multiprocessing\process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "D:\Anaconda\lib\multiprocessing\pool.py", line 108, in worker
    task = get()
  File "D:\Anaconda\lib\multiprocessing\queues.py", line 342, in get
    res = self._reader.recv_bytes()
  File "D:\Anaconda\lib\multiprocessing\connection.py", line 216, in recv_bytes
    buf = self._recv_bytes(maxlength)
  File "D:\Anaconda\lib\multiprocessing\connection.py", line 306, in _recv_bytes
    [ov.event], False, INFINITE)
KeyboardInterrupt

Thank you for your time and patience!

pmelchior commented 6 years ago

Hi Jieyu,

I reproduced your problem. The root cause is in the first line of the traceback: it tells us that there's zero probability for some sample. This should be prevented by only evaluating the probability of each sample over the neighborhood H. In your case (the default configuration), this neighborhood comprises all samples, including some that may have very low probability under any of the GMM components. So, you've got a numerical underflow. I suspect that this happens because some components shrink very strongly.

Use the argument cutoff=5 in fit, which limits the evaluation to 5-sigma neighborhoods of each component (where each sigma refers to the covariance of the component). That prevents the underflow by ignoring samples that are too far away for any component and speeds up the process considerably. I just ran the full simu with K=10 in 5 seconds.

Another option is increasing the minimum component covariance w to something larger. 1e-6 doesn't help for the range your data cover.

Jieyu-Wang commented 6 years ago

Thanks for helping me out and for a quick reply!! Now it works fine for my model.