jmschrei / pomegranate

Fast, flexible and easy to use probabilistic modelling in Python.
http://pomegranate.readthedocs.org/en/latest/
MIT License
3.29k stars 590 forks source link

[BUG] Initializing the HMM as part of fit fails for sequences with uneven length #1023

Open AKuederle opened 1 year ago

AKuederle commented 1 year ago

Describe the bug If the model is not initialized, the fit method initialize it, before running the actual fit. This works, if only sequences with the same length are passed. However, if sequences of unequal length are provided (as supported by the fit method), the following line fails, as sequences with different length can not be concatenated.

https://github.com/jmschrei/pomegranate/blob/c77f967a2b66505b42a4fc4063fcf1d26406a9a5/pomegranate/hmm/_base.py#L587

Not sure what the correct solution is here...

jmschrei commented 1 year ago

I'm able to reproduce this with the following script:

import torch

from pomegranate.distributions import Normal
from pomegranate.hmm import DenseHMM

X = [torch.randn(i // 5, i, 3) for i in range(20, 100)]

d = [Normal(), Normal()]
model = DenseHMM(d, verbose=True)
model.fit(X)

Looks like I only tested when the first dimension (batch size) is the same across all batches. I'll add in a fix today but I'm waiting for a bit more user feedback before releasing the first patch.

Here's a workaround: just reshape the data yourself and call _initialize. All it's doing is running k-means on a 2D matrix so you can break sequence boundaries without concern.

import torch

from pomegranate.distributions import Normal
from pomegranate.hmm import DenseHMM

X = [torch.randn(i // 5, i, 3) for i in range(20, 100)]
X_ = torch.cat([x.reshape(-1, 3) for x in X], dim=0).unsqueeze(0) # Add this

d = [Normal(), Normal()]
model = DenseHMM(d, verbose=True)
model._initialize(X_) # Add this too
model.fit(X)

Would you mind providing simple reproducing scripts in the future? That would help me debug.

AKuederle commented 1 year ago

Thanks for looking into this!

jmschrei commented 1 year ago

Did that code work for you?

majgah commented 9 months ago

Hi, I am experiencing the same problem. For example in the example given in the documentation, how can I add a second sequence to X and then call model.predict(X)? I want the model to learn all the parameters based on all the observed sequences.

sequence = 'CGACTACTGACTACTCGCCGACGCGACTGCCGTCTATACTGCGCATACGGC' X = numpy.array([[[['A', 'C', 'G', 'T'].index(char)] for char in sequence]]) X.shape

Like adding the sequence below to X

sequence1 = 'CGACTACTGACTACTCGCCGACGCGACTGCC'

jmschrei commented 7 months ago

If you want to process two sequences of different lengths, you'll need to run predict twice, each on a tensor with a batch size of 1 and differing sequence lengths. Each method can only run on a tensor of a fixed size. fit can operate on tensors of different sizes only because I added a convenient utility inside.