jmschrei / pomegranate

Fast, flexible and easy to use probabilistic modelling in Python.
http://pomegranate.readthedocs.org/en/latest/
MIT License
3.37k stars 590 forks source link

[BUG] n_jobs #928

Closed hammondm closed 3 years ago

hammondm commented 3 years ago

Describe the bug A clear and concise description of what the bug is, including what you were expecting to happen and what actually happened. Please report the version of pomegranate that you are using and the operating system. Also, please make sure that you have upgraded to the latest version of pomegranate before submitting the bug report.

To Reproduce Please provide a snippet of code that can reproduce this error. It is much easier for us to track down bugs and fix them if we have an example script that fails until we're successful.

hammondm commented 3 years ago

Hi

I've got a set of HiddenMarkovModels with initial settings that I change with fit().

When I use n_jobs as an argument to fit(), that bumps CPU usage up to 210% at most even though I have more resources available.

I've tried different options with n_jobs, but the same behavior.

I've confirmed this on Mac and with docker on ubuntu, same behavior on both machines.

Any ideas?

mike h.

jmschrei commented 3 years ago

Hi Mike

I don't understand what the issue is. Can you please provide more details?

hammondm commented 3 years ago

Hi.

Sorry to be opaque. I'm appending the code below. This is basically a demo of how HMMs plus vector quantization doesn't work very well for speech recognition. It uses the speech commands dataset.

The problem is that though I have 16 cores available, top never shows above 210-215%. The question is why.

Interestingly, the next demo in the series uses Gaussian multivariate models. That runs very slow, but top shows that it goes up to 450%.

Does that make sense?

mike h.

import os,librosa
from scipy.io import wavfile
from sklearn.cluster import KMeans
import numpy as np
import pomegranate as p

order = 6
wlength = 130
clusters = 15
numtrain = 150

digits = [
    'zero','one','two','three','four',
    'five','six','seven','eight','nine'
]

where = '/mhdata/commands/'
#where = '/Users/hammond/Desktop/commands/'

#create stored digits
allscores = []
filelist = []
for digit in digits:
    digitset = []
    files = os.listdir(where+digit)
    filelist.append(files)
    for f in files:
        try:
            fs,w = wavfile.read(where + digit + '/' + f)
            w = w.astype(float)
            cur = 0
            res = []
            while cur+wlength <= len(w):
                lpc = librosa.lpc(w[cur:cur+wlength],order)
                res.append(lpc)
                cur += wlength
            res = np.array(res)
            digitset.append(res)
        except:
            print(f'error, skipping: {digit}/{f}')
        if len(digitset) == numtrain+10: break
    allscores.append(digitset)

#extract training items
train = []
for score in allscores:
    for digit in score[:numtrain]:
        train.append(digit)
train = np.vstack(train)

#use k-means to make clusters
print('clustering...')
km = KMeans(init='random',n_clusters=clusters)
km.fit(train)

#convert everything to VQ codes
allcodes = []
for digits in allscores:
    digitset = []
    for digit in digits:
        code = km.predict(digit)
        digitset.append(code)
    allcodes.append(digitset)

#make linear HMMs
print('creating HMMs...')
segments = np.array([4,3,2,3,3,4,4,5,3,4,3])
lengths = segments*3 + 2
clusterprob = 1/clusters
dist = {i:clusterprob for i in range(clusters)}
hmms = []
for i in range(10):
    states = lengths[i]
    m = p.HiddenMarkovModel('d' + str(i))
    #states
    statelist = []
    for s in range(states):
        d = p.DiscreteDistribution(dist.copy())
        s = p.State(d,name='s' + str(s))
        statelist.append(s)
    m.add_states(statelist)
    #start prob
    m.add_transition(m.start,statelist[0],1.0)
    #final state
    m.add_transition(m.end,statelist[-1],0.5)
    #loop transitions
    for state in statelist:
        m.add_transition(state,state,0.5)
    #sequential transitions
    for i in range(len(statelist)-1):
        m.add_transition(statelist[i],statelist[i+1],0.5)
    m.bake()
    hmms.append(m)

#train HMMs
print('training...')
for i in range(10):
    print(i)
    trainset = allcodes[i][:numtrain]
    hmm = hmms[i]
    hmm.fit(trainset,n_jobs=-1)

#test HMMs
print('testing...')
total = 0
for i in range(10):
    testset = allcodes[i][numtrain:]
    for testitem in testset:
        allres = []
        for hmm in hmms:
            res = hmm.probability(testitem)
            allres.append(res)
        allres = np.array(allres)
        idx = allres.argmax()
        if idx == i: total += 1

print(f'Correct: {total}/100')
jmschrei commented 3 years ago

That's weird. I usually see a linear increase in the amount of speedup with the number of cores. The first two things that come to mind: (1) have you checked to confirm there are many training samples? The parallelization is done across examples, so if there are only, say, 4 examples, it might not get faster than 2x the speed. (2) How fast is each iteration right now? Potentially, if the sequences are short, it's faster to align each sequence to the model than it is to do the bookkeeping that needs to be done on the main thread. If each iteration of EM is very fast you might not get much of a further speed gain.

hammondm commented 3 years ago

Hi Jacob

I tried increasing the number of items from around 100 to around 1000 and no change, still hovering around 220%.

These are usually 6th-order LPC matrices, so each training item is something like 6x160.

Could it be that 1000 is still too few to kick in more parallelism?

mike h.

jmschrei commented 3 years ago

I don't have a good sense without seeing the data for how big it needs to be. I think the more important thing to look at is the time per EM iteration. If you set verbose=True, how long is each one taking?

hammondm commented 3 years ago

Hi Jacob

I ran it with verbose=True and got this:

...
[528] Improvement: 1.2360032997094095e-09   Time (s): 0.02792
[529] Improvement: 1.2014425010420382e-09   Time (s): 0.02759
[530] Improvement: 1.1741576599888504e-09   Time (s): 0.02773
[531] Improvement: 1.1459633242338896e-09   Time (s): 0.02762
[532] Improvement: 1.1041265679523349e-09   Time (s): 0.02813
[533] Improvement: 1.0795702110044658e-09   Time (s): 0.02693
[534] Improvement: 1.0504663805477321e-09   Time (s): 0.02788
[535] Improvement: 1.0177245712839067e-09   Time (s): 0.02777
[536] Improvement: 9.968061931431293e-10    Time (s): 0.02801
Total Training Improvement: 5395.212353176502
Total Training Time (s): 15.1980

What's your sense? Is that too little time per item to trigger more parallelism?

Incidentally, I'm running this on the "speech commands" dataset which is publically available at: http://download.tensorflow.org/data/speech_commands_v0.02.tar.gz

mike h

jmschrei commented 3 years ago

Oh, yeah, that seems entirely too fast to benefit from much parallelization. Generally, you'll only start seeing real speed gains once the time is over one second. I think that, in this particular case, you'll have more luck setting the stop threshold to above 1e-10. It's probably not necessary for you to go out that far. Given that your total improvement is ~5395, I don't think that adding 1e-9 to that is important. Maybe even set the threshold to 0.1 or something.

hammondm commented 3 years ago

Ah, understood. Thanks!