refactored _log_probability function to avoid gil in multithreading

I noticed that when using training HiddenMarkovModel with emissions composed by ExponentialDistribution wrapped in IndependentComponentsDistribution, the training was orders of magnitude slower than when using ExponentialDistribution directly. I was able to replicate the same behaviour also with NormalDistribution and so I narrowed it down to IndependentComponentsDistribution.

The slower speed is noticable only when using multithreading in fit (n_jobs > 1). I was able to narrow down the cause to the function _log_probability of IndependentComponentsDistribution. The following version of the function leads to slow behavior:

cdef void _log_probability(self, double* X, double* log_probability, int n) nogil:
        cdef int i, j
        cdef double logp

        memset(log_probability, 0, n*sizeof(double))

        for i in range(n):
            for j in range(self.d):
                          if self.cython == 1:
                              (<Model> self.distributions_ptr[j])._log_probability(X+i*self.d+j, &logp, 1)
                          else:
                              with gil:
                                  python_log_probability(self.distributions[j], X+i*self.d+j, &logp, 1)

                          log_probability[i] += logp * self.weights_ptr[j]

But the following is fast

cdef void _log_probability(self, double* X, double* log_probability, int n) nogil:
        cdef int i, j
        cdef double logp

        memset(log_probability, 0, n*sizeof(double))

        for i in range(n):
            for j in range(self.d):
                          if True or self.cython == 1: # notice the True here
                              (<Model> self.distributions_ptr[j])._log_probability(X+i*self.d+j, &logp, 1)
                          else:
                              with gil:
                                  python_log_probability(self.distributions[j], X+i*self.d+j, &logp, 1)

                          log_probability[i] += logp * self.weights_ptr[j]

This happens even though self.cython is always 1, and so the else block is never executed. Moreover, this is also slow:

cdef void _log_probability(self, double* X, double* log_probability, int n) nogil:
        cdef int i, j
        cdef double logp

        memset(log_probability, 0, n*sizeof(double))

        for i in range(n):
            for j in range(self.d):
                          if  self.cython == 1 or True: # notice the True here with reverse order
                              (<Model> self.distributions_ptr[j])._log_probability(X+i*self.d+j, &logp, 1)
                          else:
                              with gil:
                                  python_log_probability(self.distributions[j], X+i*self.d+j, &logp, 1)

                          log_probability[i] += logp * self.weights_ptr[j]

I think that branch in the for loop somehow makes it harder for cython to compile efficient code. Here a complete reproducible example that elicits the slow behaviour:

from pomegranate.distributions import ExponentialDistribution, IndependentComponentsDistribution
from pomegranate.base import State
from pomegranate.io import SequenceGenerator
from pomegranate.hmm import HiddenMarkovModel
import numpy as np

def get_model(A_d, B_d):
    model = HiddenMarkovModel()

    A_s = State(A_d, name="A")
    B_s = State(B_d, name="B")

    model.add_states([A_s, B_s])

    model.add_transition(
        a=model.start,
        b=A_s,
        probability=0.5,
    )
    model.add_transition(
        a=model.start,
        b=B_s,
        probability=0.5,
    )
    model.add_transition(
        a=A_s,
        b=A_s,
        probability=0.99,
    )
    model.add_transition(
        a=B_s,
        b=B_s,
        probability=0.99,
    )
    model.add_transition(
        a=A_s,
        b=B_s,
        probability=0.01,
    )
    model.add_transition(
        a=B_s,
        b=A_s,
        probability=0.01,
    )

    model.bake(verbose=True)

    return model

model = get_model(
    IndependentComponentsDistribution(
        [ExponentialDistribution(1), ExponentialDistribution(1), ExponentialDistribution(1)]
    ),
    IndependentComponentsDistribution(
        [ExponentialDistribution(100), ExponentialDistribution(100), ExponentialDistribution(100)]
    )
)
X = np.zeros((1000, 3))
X[:500] = 1
X[500:] = 1/100
X_set = [X]*1000
model.fit(X_set, min_iterations=10, max_iterations=10, verbose=True, n_jobs=-1)

model = get_model(
    ExponentialDistribution(1),
    ExponentialDistribution(100)
)
X = np.zeros((1000))
X[:500] = 1
X[500:] = 1/100
X_set = [X]*1000
model.fit(X_set, min_iterations=10, max_iterations=10, verbose=True, n_jobs=-1)

The first fit call is slow with the original pomegranate code:

[1] Improvement: 8114.942703745794  Time (s): 6.926
[2] Improvement: 0.0011873976327478886  Time (s): 8.598
[3] Improvement: 1.3969838619232178e-09 Time (s): 8.284
[4] Improvement: -6.426125764846802e-08 Time (s): 8.033
[5] Improvement: 6.472691893577576e-08  Time (s): 8.972
[6] Improvement: -6.472691893577576e-08 Time (s): 8.011
[7] Improvement: 6.472691893577576e-08  Time (s): 7.855
[8] Improvement: -6.472691893577576e-08 Time (s): 8.366
[9] Improvement: 6.472691893577576e-08  Time (s): 8.269
[10] Improvement: -6.472691893577576e-08    Time (s): 8.243
Total Training Improvement: 8114.943891080562
Total Training Time (s): 88.2151

The second fit call without IndependentComponentsDistribution is fast:

[1] Improvement: 8111.520942713367  Time (s): 0.1579
[2] Improvement: 1.6850003311410546 Time (s): 0.1475
[3] Improvement: 0.00018647708930075169 Time (s): 0.1554
[4] Improvement: 1.3271346688270569e-08 Time (s): 0.1509
[5] Improvement: -1.234002411365509e-08 Time (s): 0.1648
[6] Improvement: 2.3283064365386963e-10 Time (s): 0.1637
[7] Improvement: 0.0    Time (s): 0.1708
[8] Improvement: 0.0    Time (s): 0.1509
[9] Improvement: 0.0    Time (s): 0.1576
[10] Improvement: 0.0   Time (s): 0.1464
Total Training Improvement: 8113.206129522761
Total Training Time (s): 1.7545

With the modifications in this commit, the slow fit call becomes fast:

[1] Improvement: 8114.942703745794  Time (s): 0.1773
[2] Improvement: 0.0011873976327478886  Time (s): 0.1736
[3] Improvement: 1.3969838619232178e-09 Time (s): 0.1783
[4] Improvement: -6.426125764846802e-08 Time (s): 0.1667
[5] Improvement: 6.472691893577576e-08  Time (s): 0.1846
[6] Improvement: -6.472691893577576e-08 Time (s): 0.1884
[7] Improvement: 6.472691893577576e-08  Time (s): 0.1826
[8] Improvement: -6.472691893577576e-08 Time (s): 0.1638
[9] Improvement: 6.472691893577576e-08  Time (s): 0.1868
[10] Improvement: -6.472691893577576e-08    Time (s): 0.1695
Total Training Improvement: 8114.943891080562
Total Training Time (s): 1.9820

jmschrei / pomegranate

refactored _log_probability function to avoid gil in multithreading #1008