[Question] How to properly preprocess data for Markov Chain?

RotatingGiraffe commented 6 months ago

Hi there,

I am currently trying to learn a Markov Chain from one singular sequence that I have. The fit-function requires 3d input, which leads to quite a lot of reshaping and preprocessing of a simple 1d list. I was wondering if maybe I had misunderstood something?

Here's what I am currently doing.

# The sequence I want to learn a Markov Chain from
my_sequence = np.array([0, 1, 0, 0, 0, 1, 2, 1, 2, 0])

# But the data has to be 3d and in the shape of (n_samples, k+1 , 1) where k is chain order
k = 2

# So I transform using a sliding window
window_size= k+1 # k Elements are input, 1 is output?

# Construct a sliding window
X = []
for i in range(0, len(my_sequence) - k, 1):
   # I take the relevant object + k previous ones and put them into a list
    X.append(my_sequence[i:i+window_size])

# Now reshape it so it is 3d!
X = np.array(X)
d1, d2 = X.shape
X = X.reshape(d1, d2, 1)
# Shape for this example is now (8, 3, 1)

# Now we can create the chain
mc = MarkovChain(k=k)
# and fit
mc.fit(X)

# When sampling I get the same shape as the input, although I would've just expected a 1d array that behaves like my original sequence

Is this intended usage/behaviour? I would appreciate some help and maybe an addition for a simple sequence in the tutorial, including potential preprocessing.

jmschrei commented 6 months ago

Hi @RotatingGiraffe.

Sorry that you're encountering some issues.

You don't need to construct the sliding window yourself. If you have a single sequence of length 10 with a single feature, you just need to reshape your data to have shape (1, 10, 1), where the dimensions mean having a single example, with a sequence length of 10, with only one feature. The reason these other dimensions exist is to have the model generalize to where someone has many sequences (and to model the start of a sequence correctly) or when the data are multivariate.

I agree that it's sort of annoying to have to reshape your data but I've found that it's better to write the API as broadly as possible so that people can fit a wide range of their own own use-cases into it, rather than support a bunch of different ways of calling the same models depending on the shape of your data. If X is your array with shape (10,) right now, you can just do X[None, :, None] to get data in the correct shape.

Let me know if you run into any other issues.

RotatingGiraffe commented 6 months ago

Thanks for you answer!

When trying your method without sliding windows, I get a RuntimeError. The model only sucessfully fits when k is len(X)-1, which is why I construced the sliding windows in the first place - it worked, although I wasn't really sure why :D

The error should be reproducable with this code:

# The sequence I want to learn a Markov Chain from
my_sequence = np.array([0, 1, 0, 0, 0, 1, 2, 1, 2, 0])

k = 2

# Reshape
X = my_sequence[None, :, None]
# Shape for this example is now (1, 10, 1)

# Now we can create the chain
mc = MarkovChain(k=k)
# and fit
mc.fit(X)

The error message is

{
"name": "RuntimeError",
"message": "index 2 is out of bounds for dimension 0 with size 2",
"stack": "
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[17], line 14
     12 mc = MarkovChain(k=k)
     13 # and fit
---> 14 mc.fit(X)

File ~/miniconda3/envs/data_science/lib/python3.11/site-packages/pomegranate/markov_chain.py:216, in MarkovChain.fit(self, X, sample_weight)
    193 def fit(self, X, sample_weight=None):
    194 \t \"\"\"Fit the model to optionally weighted examples.
    195 
    196 \tThis method will fit the provided distributions given the data and
   (...)
    213 \tself
    214 \t \"\"\"
--> 216 \t self.summarize(X, sample_weight=sample_weight)
    217 \t self.from_summaries()
    218 \t return self

File ~/miniconda3/envs/data_science/lib/python3.11/site-packages/pomegranate/markov_chain.py:276, in MarkovChain.summarize(self, X, sample_weight)
    274 for i in range(X.shape[1] - self.k):
    275 \t j = i + self.k + 1
--> 276 \t distribution.summarize(X[:, i:j], sample_weight=sample_weight)

File ~/miniconda3/envs/data_science/lib/python3.11/site-packages/pomegranate/distributions/conditional_categorical.py:168, in ConditionalCategorical.summarize(self, X, sample_weight)
    165 strides = torch.tensor(self._xw_sum[j].stride(), device=X.device)
    166 X_ = torch.sum(X[:, :, j] * strides, dim=-1)
--> 168 self._xw_sum[j].view(-1).scatter_add_(0, X_, sample_weight[:,j])
    169 self._w_sum[j][:] = self._xw_sum[j].sum(dim=-1)

RuntimeError: index 2 is out of bounds for dimension 0 with size 2"
}

Weirdly, I don't get this error when k is len-1, so 9 in this example. So with the following code, fitting is successfull.

# The sequence I want to learn a Markov Chain from
my_sequence = np.array([0, 1, 0, 0, 0, 1, 2, 1, 2, 0])

k = 9

# Reshape
X = my_sequence[None, :, None]
# Shape for this example is now (1, 10, 1)

# Now we can create the chain
mc = MarkovChain(k=k)
# and fit
mc.fit(X)

Maybe this is a bug?

Package info from conda list: Name: pomegranate Version: 1.0.0 Build: pyhd8ed1ab_1 Channel: conda-forge

jmschrei commented 6 months ago

Try getting the latest code, v1.0.4. I think I just resolved this issue for someone else.

RotatingGiraffe commented 6 months ago

Installing v.1.0.4 directly from the git repository resolved the issue. (conda-forge is still on v.1.0.0)

Thank you for your help!

jkleckner commented 6 months ago

(conda-forge is still on v.1.0.0)

It would be good to tag v1.0.4 so that conda-forge can pick it up. This will likely hit other people.

jmschrei / pomegranate

[Question] How to properly preprocess data for Markov Chain? #1082