jmschrei / pomegranate

Fast, flexible and easy to use probabilistic modelling in Python.
http://pomegranate.readthedocs.org/en/latest/
MIT License
3.35k stars 589 forks source link

Training a HMM from samples (supervised training) #960

Closed teoML closed 1 year ago

teoML commented 2 years ago

Hi ! First of all, thank you for developing this library! I want to train a HMM from given samples of observations and corresponding labels. The labels are the hidden states which the hmm will have.

I have a dataset which looks like this (just a sample here):

   timestamp  sensor1   sensor2  sensor3  sensor4    action
           1    0.05       0.04    0.10      0.39       A1
           2    0.25       0.14    0.11      0.34       A2
           3    0.15       0.34    0.13      0.36       A3
        .......

So, as seen above, I have 4 sensor values for each timestamp and in the annotated dataset I also have the action (A1-A4). That basically means that for my supervised problem each of my observations is a 4-dimensional feature vector which is annotated with an action label. I saw that in pomegranate, we can create the model from given samples. I tried running a supervised training procedure, but for some reason I am getting an error ( ValueError: zero-dimensional arrays cannot be concatenated).

# import some libraries
import numpy as np
import pomegranate as pg

# each observation consists of the values of four different sensors: A1, A2, A3, A4
# we have three different states S1, S2, S3
# an observation sequence is given below - each list element is a vector where each dimention corresponds to sensor A1 - A4 respectively

obs_seq = np.array([[0.4, 0.32, 0.56, 0.7],[0.4, 0.82, 0.96, 0.47],[0.43, 0.12, 0.56, 0.27],[0.4, 0.9, 0.46, 0.1],[0.2, 0.32, 0.36, 0.1],[0.14, 0.267, 0.68, 0.57], [0.34, 0.762, 0.76, 0.73], [0.4, 0.22, 0.56, 0.47], [0.43, 0.12, 0.56, 0.27], [0.24, 0.19, 0.84, 0.1], [0.22, 0.32, 0.61, 0.7], [0.94, 0.234, 0.83, 0.77],
           [0.34, 0.52, 0.89, 0.4],[0.9, 0.72, 0.56, 0.17],[0.43, 0.12, 0.56, 0.27], [0.64, 0.69, 0.48, 0.1],[0.25, 0.362, 0.16, 0.6],[0.34, 0.214, 0.18, 0.67],
           [0.64, 0.72, 0.77, 0.1],[0.3, 0.62, 0.76, 0.37],[0.43, 0.12, 0.56, 0.27],[0.74, 0.52, 0.96, 0.1],[0.22, 0.342, 0.46, 0.5],[0.54, 0.63, 0.67, 0.27],
           [0.14, 0.38, 0.26, 0.5],[0.5, 0.52, 0.12, 0.657],[0.43, 0.12, 0.56, 0.27],[0.33, 0.26, 0.93, 0.1],[0.432, 0.32, 0.66, 0.3],[0.74, 0.07, 0.43, 0.47],
           [0.24, 0.22, 0.36, 0.6],[0.67, 0.32, 0.16, 0.26],[0.43, 0.12, 0.56, 0.27],[0.67, 0.22, 0.90, 0.1],[0.22, 0.314, 0.42, 0.2],[0.84, 0.17, 0.13, 0.67]])

obs_states = ["A1", "A3", "A1", "A1", "A1", "A3", 
              "A3", "A2", "A2", "A1", "A3", "A1",
              "A3", "A3", "A1", "A1", "A1", "A3", 
              "A2", "A2", "A1", "A1", "A3", "A1",
              "A2", "A3", "A1", "A1", "A1", "A3", 
              "A2", "A2", "A1", "A1", "A3", "A2",
              ]
states_names = ["A1", "A2", "A3"]`

#building the markov model from the samples

model = pg.HiddenMarkovModel.from_samples(pg.NormalDistribution,
                                          n_components = 3,
                                          state_names = states_names,
                                          X = obs_seq, 
                                          labels= obs_states,
                                          algorithm='labeled')

obs_seq represents one long sequence where each vector has 4 dimensions representing the values measured by the sensors. in the obs_states variable I have the corresponding labels to each of the 4-dimensional vectors in obs_seq.

I also created a google colab notebook so that you can run the example by yourself.

https://colab.research.google.com/drive/10ZwBef9SsF5I5i3SnPr4dyXwr8z8jdm1?usp=sharing

Thank you for your help!

ghada-source commented 2 years ago

I work on the same type of subject, I also encounter this problem. Did you find a solution?

jmschrei commented 2 years ago

Sorry you encountered issues. Multivariable data needs to have three dimensions either as a fixed-dimension array with dimensions (n_samples, n_observations, n_dimensions) or as a list of 2D arrays where each array is (n_observations, n_dimensions). Even if you only have a single example you need to either have the data in a list with a single element or as a numpy array with the first dimension being 1. Same goes for the labels.

Here is the code I got to run:

obs_seq = np.array([[[0.4, 0.32, 0.56, 0.7],[0.4, 0.82, 0.96, 0.47],[0.43, 0.12, 0.56, 0.27],[0.4, 0.9, 0.46, 0.1],[0.2, 0.32, 0.36, 0.1],[0.14, 0.267, 0.68, 0.57], [0.34, 0.762, 0.76, 0.73], [0.4, 0.22, 0.56, 0.47], [0.43, 0.12, 0.56, 0.27], [0.24, 0.19, 0.84, 0.1], [0.22, 0.32, 0.61, 0.7], [0.94, 0.234, 0.83, 0.77],
           [0.34, 0.52, 0.89, 0.4],[0.9, 0.72, 0.56, 0.17],[0.43, 0.12, 0.56, 0.27], [0.64, 0.69, 0.48, 0.1],[0.25, 0.362, 0.16, 0.6],[0.34, 0.214, 0.18, 0.67],
           [0.64, 0.72, 0.77, 0.1],[0.3, 0.62, 0.76, 0.37],[0.43, 0.12, 0.56, 0.27],[0.74, 0.52, 0.96, 0.1],[0.22, 0.342, 0.46, 0.5],[0.54, 0.63, 0.67, 0.27],
           [0.14, 0.38, 0.26, 0.5],[0.5, 0.52, 0.12, 0.657],[0.43, 0.12, 0.56, 0.27],[0.33, 0.26, 0.93, 0.1],[0.432, 0.32, 0.66, 0.3],[0.74, 0.07, 0.43, 0.47],
           [0.24, 0.22, 0.36, 0.6],[0.67, 0.32, 0.16, 0.26],[0.43, 0.12, 0.56, 0.27],[0.67, 0.22, 0.90, 0.1],[0.22, 0.314, 0.42, 0.2],[0.84, 0.17, 0.13, 0.67]]])

obs_states = np.array([["A1", "A3", "A1", "A1", "A1", "A3", 
              "A3", "A2", "A2", "A1", "A3", "A1",
              "A3", "A3", "A1", "A1", "A1", "A3", 
              "A2", "A2", "A1", "A1", "A3", "A1",
              "A2", "A3", "A1", "A1", "A1", "A3", 
              "A2", "A2", "A1", "A1", "A3", "A2",
              ]])
ghada-source commented 2 years ago

n_samples

what's the difference between n_samples and n_observations and n_dimensions ?

jmschrei commented 2 years ago

HMMs can be trained on one sequence or on multiple sequences. n_samples is the number of sequences and n_observations is the number of elements in the sequence. n_dimensions is the number of dimensions these elements have.

teoML commented 2 years ago

@jmschrei , thank you for mentioning also the training on multiple sequences (I have a scenario, where the same experiment is performed by different people and I thus the sensor measurements and the sequence of actions might be different). I have 2 other questions now - In my example each observation consists of 4 sensor values. Can I somehow set the distribution and the value range for each of the values (in my case all the values of the 4-dimensional vector are in a range 0 - 1)so that they can be considered by the HMM ? My second question is - how to do the prediction of a new sensor observation sequence : the way I tested it was calling model.predict([[0.4, 0.32, 0.56, 0.7],[0.4, 0.82, 0.96, 0.47],[0.43, 0.12, 0.56, 0.27]]) and I think this is the right way, since the output was [1, 0, 0] (btw, how to see the actual labels and what is the ordering? - my labels are "A1", "A2", "A3", not 0,1,2).

jmschrei commented 1 year ago

Thank you for opening an issue. pomegranate has recently been rewritten from the ground up to use PyTorch instead of Cython (v1.0.0), and so all issues are being closed as they are likely out of date. Please re-open or start a new issue if a related issue is still present in the new codebase.