jmschrei / pomegranate

Fast, flexible and easy to use probabilistic modelling in Python.
http://pomegranate.readthedocs.org/en/latest/
MIT License
3.33k stars 588 forks source link

Missing observations for object data sets #567

Closed mabuodeh closed 5 years ago

mabuodeh commented 5 years ago

Hey!

First of all, thanks for this awesome library!

Second, I was looking around for support for missing observations, and came across this issue, as well as this mentioned in the documentation: "Missing value support was added in a manner that requires the least user thought. All one has to do is add numpy.nan to mark an entry as missing for numeric data sets, or the string 'nan' for string data sets."

However, my data set is composed of objects, not numbers or strings. So how should I go about adding a missing object observation?

Thanks in advance!

jmschrei commented 5 years ago

Howdy

Internally, pomegranate will convert your objects to a list of integers. You can do this yourself by defining some key mapping for each column independently. After you do this conversion from objects to integers, you can just add in numpy.nan when you don't have a value. The easiest way to do the conversion is something like this:

X = ... your numpy array of objects ...

keymaps = [{value: key for key, value in enumerate(numpy.unique(X[:,i]))} for j in range(X.shape[1])]

new_X = numpy.empty(*X.shape)
for i in range(X.shape[0]):
    for j in range(X.shape[1]):
        new_X[i, j] = keymaps[j][X[i, j]   

I didn't bug-test that code, but it should be something like that.

Does that make sense? Let me know if you have any other questions.

mabuodeh commented 5 years ago

Alright so I understand what you're saying, but when it comes to inference on samples that have missing values, how would I go about solving it?

I'll add a bit more detail about what I did so far,

Based on my understanding, I added a 'nan' observation in each state distribution and, after training my model on similar sequences (let's say ABCD, ABCCD, ABCDCD, ..), I would try to predict the sequence of states of the sequence ACD. I'd want the output to provide something like XXX, where the X's are states that emitted the observations. However, the sequence of states would have no 's (though the log probability was pretty high).

I spent a bit of time looking over your answer, and I have a feeling it isn't meant for inference, or I may have not understood the main idea that well.

jmschrei commented 5 years ago

I'm not sure what the concern is. You should convert your data from objects to integers for input to the model. You can do this using https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html if you'd like. When the variable is missing, instead of using an integer, you should put np.nan in for that entry. It should end up looking something like this:

[[1, 2, 0, 3, nan, nan, 2],
 [0, nan, 1, 2, 0, 0, 1],
 [1, 0, 1, 0, 0, 1, 0]]

When you feed that into predict or predict_proba, it should replace the nan values with the predicted distribution.

However, your example sounds like it's better suited for an HMM. An HMM solves the task of tagging each observation in a variable length sequence with a "hidden state", where there's an underlying transition structure between the hidden states. Bayesian networks are much better suited for fixed length observations.

mabuodeh commented 5 years ago

Alright I understand what you mean. And I am using an HMM already. I'll add the missing observations then, thanks for the information!

mabuodeh commented 5 years ago

I tried to implement it using the tutorial provided for HMMs, and it gave me an error.

I switched the chars to objects: X = [C,G,A,C,T,A,C,T,G,A,C,T,A,C,T,C,G,C,C,G,A,C,G,C,G,A,C,T,G,C,C,G,T,C,T,A,T,A,C,T,G,C,G,C,A,T,A,C,G,G,C]

I then used scikit learn to transform the sequence into a sequence of integers: [1 3 0 1 2 0 1 2 3 0 1 2 0 1 2 1 3 1 1 3 0 1 3 1 3 0 1 2 3 1 1 3 2 1 2 0 2 0 1 2 3 1 3 1 0 2 0 1 3 3 1]

where the classes_ are: [A C T G]

I then attempted to predict the sequence: hmm_predictions = model.predict(seq)

and the error was that '1' was not in the distribution. And that makes sense, since my observations are A, C, G, and T. Does this mean that I have to replace all my observations with integers from now on (the initial sequence, as well as all the sequence I'll be using to train the model)?

jmschrei commented 5 years ago

Yes, you'll need to convert from your characters to the integers in both the training and the inference.

mabuodeh commented 5 years ago

Alright so I gave the encoder a try, and when I tried to add numpy.nan, it gave me an issue that had to do with floating points and int lists. Anyway, I just cloned the repo and made changes directly in hmm.py. I added these lines to check_input:

if isinstance(symbol, str) and symbol == 'nan':
                sequence_ndarray[i] = numpy.nan
            elif isinstance(symbol, (int, float)) and numpy.isnan(symbol):
                sequence_ndarray[i] = numpy.nan
            # sequence[i] is an object, contains the missing_observation attirbute, and missing_obs is true
            elif (sequence[i] in keymap) and hasattr(sequence[i], 'missing_observation') and sequence[i].missing_observation:
                print('missing obs true')
                print(symbol, ' ', keymap[symbol])
                sequence_ndarray[i] = numpy.nan
            elif sequence[i] in keymap:
                sequence_ndarray[i] = keymap[symbol]
            else:
                raise ValueError("Symbol '{}' is not defined in a distribution"
                    .format(symbol))

What it does is, after checking if an entry is a string, int, or float, it checks if it is an object, and if the object contains an attribute 'missing_observation'. If it does, then it checks if the attribute was assigned true, thus determining whether or not the observation is missing.

Since this is a sort of workaround I don't think a pull request makes sense, though I figured I'd share it anyway. Thanks for clarifying a few points that I misunderstood!