jmschrei / pomegranate

Fast, flexible and easy to use probabilistic modelling in Python.
http://pomegranate.readthedocs.org/en/latest/
MIT License
3.35k stars 589 forks source link

[Question] Fitting multivariate Markov Chain throws Index out of bounds error #1077

Closed salpers closed 2 months ago

salpers commented 8 months ago

Hey there,

I try to fit a Markov Chain model on multivariate, categorical sequential data. After label encoding my sequences to integers, I pad them with 0 so they all have the same length. The resulting Tensor is of shape (932,132,3) - 932 Observations of length 132 (0 padded) with 3 features for each element.

However, I get an Index out of bounds error when I try to fit the model.

from pomegranate.markov_chain import MarkovChain

model = MarkovChain(k = 3)
model.fit(data)
File ... /pomegranate/markov_chain.py:216, in MarkovChain.fit(self, X, sample_weight)
    193 def fit(self, X, sample_weight=None):
    194     """Fit the model to optionally weighted examples.
    195 
    196     This method will fit the provided distributions given the data and
   (...)
    213     self
    214     """
--> 216     self.summarize(X, sample_weight=sample_weight)
    217     self.from_summaries()
    218     return self

File .../pomegranate/markov_chain.py:276, in MarkovChain.summarize(self, X, sample_weight)
    274 for i in range(X.shape[1] - self.k):
    275     j = i + self.k + 1
--> 276     distribution.summarize(X[:, i:j], sample_weight=sample_weight)

File .../pomegranate/distributions/conditional_categorical.py:168, in ConditionalCategorical.summarize(self, X, sample_weight)
    165 strides = torch.tensor(self._xw_sum[j].stride(), device=X.device)
    166 X_ = torch.sum(X[:, :, j] * strides, dim=-1)
--> 168 self._xw_sum[j].view(-1).scatter_add_(0, X_, sample_weight[:,j])
    169 self._w_sum[j][:] = self._xw_sum[j].sum(dim=-1)

RuntimeError: index 21869 is out of bounds for dimension 0 with size 14520

I would appreciate it if you could help me with the issue or point out any mistakes in my approach.

salpers commented 8 months ago

I experimented with changing the data, however the issue is also reproducible with random small data.

import numpy as np
from pomegranate.markov_chain import MarkovChain

np.random.seed(137)
seq_data = np.random.randint(0, 10, (1,10,1))

model = MarkovChain(k = 1)
model.fit(seq_data) 

throws

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[99], line 5
      2 seq_data = np.random.randint(0, 10, (1,6,1))
      4 model = MarkovChain(k = 1)
----> 5 model.fit(seq_data)

File /opt/conda/lib/python3.10/site-packages/pomegranate/markov_chain.py:216, in MarkovChain.fit(self, X, sample_weight)
    193 def fit(self, X, sample_weight=None):
    194     """Fit the model to optionally weighted examples.
    195 
    196     This method will fit the provided distributions given the data and
   (...)
    213     self
    214     """
--> 216     self.summarize(X, sample_weight=sample_weight)
    217     self.from_summaries()
    218     return self

File /opt/conda/lib/python3.10/site-packages/pomegranate/markov_chain.py:276, in MarkovChain.summarize(self, X, sample_weight)
    274 for i in range(X.shape[1] - self.k):
    275     j = i + self.k + 1
--> 276     distribution.summarize(X[:, i:j], sample_weight=sample_weight)

File /opt/conda/lib/python3.10/site-packages/pomegranate/distributions/conditional_categorical.py:168, in ConditionalCategorical.summarize(self, X, sample_weight)
    165 strides = torch.tensor(self._xw_sum[j].stride(), device=X.device)
    166 X_ = torch.sum(X[:, :, j] * strides, dim=-1)
--> 168 self._xw_sum[j].view(-1).scatter_add_(0, X_, sample_weight[:,j])
    169 self._w_sum[j][:] = self._xw_sum[j].sum(dim=-1)

RuntimeError: index 42 is out of bounds for dimension 0 with size 28
Koenig128 commented 6 months ago

Hi,

I got the same error. Have you been able to fix it in the meantime? Does anyone else have a suggestion?

I would really appreciate any help on this.

Thank you!

jmschrei commented 6 months ago

This should be fixed in v1.0.4. Please let me know if you encounter any other issues. In the future, if you run into challenges you can pass in n_categories to the MarkovChain or make the list of distributions (one Categorical and then a series of k ConditionalCategorical objects) yourself.