jmschrei / pomegranate

Fast, flexible and easy to use probabilistic modelling in Python.
http://pomegranate.readthedocs.org/en/latest/
MIT License
3.35k stars 589 forks source link

refactored _log_probability function to avoid gil in multithreading #1008

Closed ghost closed 1 year ago

ghost commented 1 year ago

I noticed that when using training HiddenMarkovModel with emissions composed by ExponentialDistribution wrapped in IndependentComponentsDistribution, the training was orders of magnitude slower than when using ExponentialDistribution directly. I was able to replicate the same behaviour also with NormalDistribution and so I narrowed it down to IndependentComponentsDistribution.

The slower speed is noticable only when using multithreading in fit (n_jobs > 1). I was able to narrow down the cause to the function _log_probability of IndependentComponentsDistribution. The following version of the function leads to slow behavior:

cdef void _log_probability(self, double* X, double* log_probability, int n) nogil:
        cdef int i, j
        cdef double logp

        memset(log_probability, 0, n*sizeof(double))

        for i in range(n):
            for j in range(self.d):
                          if self.cython == 1:
                              (<Model> self.distributions_ptr[j])._log_probability(X+i*self.d+j, &logp, 1)
                          else:
                              with gil:
                                  python_log_probability(self.distributions[j], X+i*self.d+j, &logp, 1)

                          log_probability[i] += logp * self.weights_ptr[j]

But the following is fast

cdef void _log_probability(self, double* X, double* log_probability, int n) nogil:
        cdef int i, j
        cdef double logp

        memset(log_probability, 0, n*sizeof(double))

        for i in range(n):
            for j in range(self.d):
                          if True or self.cython == 1: # notice the True here
                              (<Model> self.distributions_ptr[j])._log_probability(X+i*self.d+j, &logp, 1)
                          else:
                              with gil:
                                  python_log_probability(self.distributions[j], X+i*self.d+j, &logp, 1)

                          log_probability[i] += logp * self.weights_ptr[j]

This happens even though self.cython is always 1, and so the else block is never executed. Moreover, this is also slow:

cdef void _log_probability(self, double* X, double* log_probability, int n) nogil:
        cdef int i, j
        cdef double logp

        memset(log_probability, 0, n*sizeof(double))

        for i in range(n):
            for j in range(self.d):
                          if  self.cython == 1 or True: # notice the True here with reverse order
                              (<Model> self.distributions_ptr[j])._log_probability(X+i*self.d+j, &logp, 1)
                          else:
                              with gil:
                                  python_log_probability(self.distributions[j], X+i*self.d+j, &logp, 1)

                          log_probability[i] += logp * self.weights_ptr[j]

I think that branch in the for loop somehow makes it harder for cython to compile efficient code. Here a complete reproducible example that elicits the slow behaviour:

from pomegranate.distributions import ExponentialDistribution, IndependentComponentsDistribution
from pomegranate.base import State
from pomegranate.io import SequenceGenerator
from pomegranate.hmm import HiddenMarkovModel
import numpy as np

def get_model(A_d, B_d):
    model = HiddenMarkovModel()

    A_s = State(A_d, name="A")
    B_s = State(B_d, name="B")

    model.add_states([A_s, B_s])

    model.add_transition(
        a=model.start,
        b=A_s,
        probability=0.5,
    )
    model.add_transition(
        a=model.start,
        b=B_s,
        probability=0.5,
    )
    model.add_transition(
        a=A_s,
        b=A_s,
        probability=0.99,
    )
    model.add_transition(
        a=B_s,
        b=B_s,
        probability=0.99,
    )
    model.add_transition(
        a=A_s,
        b=B_s,
        probability=0.01,
    )
    model.add_transition(
        a=B_s,
        b=A_s,
        probability=0.01,
    )

    model.bake(verbose=True)

    return model

model = get_model(
    IndependentComponentsDistribution(
        [ExponentialDistribution(1), ExponentialDistribution(1), ExponentialDistribution(1)]
    ),
    IndependentComponentsDistribution(
        [ExponentialDistribution(100), ExponentialDistribution(100), ExponentialDistribution(100)]
    )
)
X = np.zeros((1000, 3))
X[:500] = 1
X[500:] = 1/100
X_set = [X]*1000
model.fit(X_set, min_iterations=10, max_iterations=10, verbose=True, n_jobs=-1)

model = get_model(
    ExponentialDistribution(1),
    ExponentialDistribution(100)
)
X = np.zeros((1000))
X[:500] = 1
X[500:] = 1/100
X_set = [X]*1000
model.fit(X_set, min_iterations=10, max_iterations=10, verbose=True, n_jobs=-1)

The first fit call is slow with the original pomegranate code:

[1] Improvement: 8114.942703745794  Time (s): 6.926
[2] Improvement: 0.0011873976327478886  Time (s): 8.598
[3] Improvement: 1.3969838619232178e-09 Time (s): 8.284
[4] Improvement: -6.426125764846802e-08 Time (s): 8.033
[5] Improvement: 6.472691893577576e-08  Time (s): 8.972
[6] Improvement: -6.472691893577576e-08 Time (s): 8.011
[7] Improvement: 6.472691893577576e-08  Time (s): 7.855
[8] Improvement: -6.472691893577576e-08 Time (s): 8.366
[9] Improvement: 6.472691893577576e-08  Time (s): 8.269
[10] Improvement: -6.472691893577576e-08    Time (s): 8.243
Total Training Improvement: 8114.943891080562
Total Training Time (s): 88.2151

The second fit call without IndependentComponentsDistribution is fast:

[1] Improvement: 8111.520942713367  Time (s): 0.1579
[2] Improvement: 1.6850003311410546 Time (s): 0.1475
[3] Improvement: 0.00018647708930075169 Time (s): 0.1554
[4] Improvement: 1.3271346688270569e-08 Time (s): 0.1509
[5] Improvement: -1.234002411365509e-08 Time (s): 0.1648
[6] Improvement: 2.3283064365386963e-10 Time (s): 0.1637
[7] Improvement: 0.0    Time (s): 0.1708
[8] Improvement: 0.0    Time (s): 0.1509
[9] Improvement: 0.0    Time (s): 0.1576
[10] Improvement: 0.0   Time (s): 0.1464
Total Training Improvement: 8113.206129522761
Total Training Time (s): 1.7545

With the modifications in this commit, the slow fit call becomes fast:

[1] Improvement: 8114.942703745794  Time (s): 0.1773
[2] Improvement: 0.0011873976327478886  Time (s): 0.1736
[3] Improvement: 1.3969838619232178e-09 Time (s): 0.1783
[4] Improvement: -6.426125764846802e-08 Time (s): 0.1667
[5] Improvement: 6.472691893577576e-08  Time (s): 0.1846
[6] Improvement: -6.472691893577576e-08 Time (s): 0.1884
[7] Improvement: 6.472691893577576e-08  Time (s): 0.1826
[8] Improvement: -6.472691893577576e-08 Time (s): 0.1638
[9] Improvement: 6.472691893577576e-08  Time (s): 0.1868
[10] Improvement: -6.472691893577576e-08    Time (s): 0.1695
Total Training Improvement: 8114.943891080562
Total Training Time (s): 1.9820
jmschrei commented 1 year ago

Thank you for your contribution. However, pomegranate has recently been rewritten from the ground up to use PyTorch instead of Cython, and so this PR is unfortunately out of date and will be closed.