Make models more accessible in Python

rcurtin commented 5 years ago

This issue has to do with mlpack's Python bindings. Ideally, if you're interested in solving this, you should be familiar with how these bindings are used from Python, and have some knowledge of the automatic bindings system that generates these Python bindings. If you don't but you're still interested in this, consider looking at the Python quickstart and the automatic bindings documentation first. Also, this is an open-ended issue---I don't know the right way to solve it, and some exploration of possibilities would be needed.

One of the major drawbacks of the Python bindings is that when we build a machine learning model, it's not really inspectable by a Python user. In fact if I run the following code:

>>> import numpy as np
>>> from mlpack import logistic_regression
>>> x = np.random.rand(10, 10)
>>> y = [0 0 0 0 0 1 1 1 1 1]
>>> output = logistic_regression(training=x, labels=y, verbose=True)

then the output logistic regression model (accessed with output['output_model']) is not an easy to understand model but instead a Python object of type LogisticRegressionModel which internally just holds a pointer to the C++ memory that represents the mlpack::regression::LogisticRegression class. A Python user can't do anything with this except pass it to another call to logistic_regression() or pickle it for later use:

>>> import pickle
>>> pickle.dumps(output['output_model'])

If you take a look at the generated logistic_regression.pyx file in the src/mlpack/bindings/python/mlpack/ directory under your build directory, you can see that the model classes in Python (like LogisticRegressionModel) provide a __getstate__() and __setstate__() method for pickling, and these in turn call SerializeIn() and SerializeOut(). Those functions both use boost::serialization to output the C++ model in a binary format (perfect for pickling).

But as you can see that's not very useful for people who might want to know what the parameters of their model is. (And, the whole impetus for this issue was #1703, where I wanted to do exactly that.) Now, the logistic regression model is actually really simple---it just holds a vector of parameters. So it would be really nice if I could do something like this:

>>> output = logistic_regression(training=x, labels=y, verbose=True)
>>> model = transform_to_python(output['output_model'])
>>> model
{'parameters': [0.5, 0.6, 0.1, 0.4]}

Now maybe there are better names than transform_to_python() (in fact I am sure there are---I just can't think of any right now), but the general idea is that we give it an mlpack model, and we get back some kind of hopefully-readable dict. We don't need to go in the reverse direction right now, but it would be nice to eventually be able to do that. Sometimes a readable dict will be impossible---something like a kd-tree (which is given as part of the output models for knn() and kfn(), for instance) is really unwieldy to handle in a dict. So there will exist cases where what we give back might be totally incomprehensible, but for simpler models it should be easy to understand the output.

I thought about a few approaches to this problem:

1. Serialize to XML, and then read it in. We could write something similar to SerializeOut() that uses the xml_archive boost::serialization archive. That gives us models kind of like this:

$ cat model.xml
<?xml version="1.0" encoding="UTF-8" standalone="yes" ?>
<!DOCTYPE boost_serialization>
<boost_serialization signature="serialization::archive" version="16">
<model class_id="0" tracking_level="0" version="0">
        <parameters class_id="1" tracking_level="0" version="0">
                <n_rows>1</n_rows>
                <n_cols>3</n_cols>
                <n_elem>3</n_elem>
                <vec_state>2</vec_state>
                <item>-3.09842048843608168e-03</item>
                <item>-2.67613064852209731e-02</item>
                <item>1.63889087606279293e-02</item>
        </parameters>
        <lambda>0.00000000000000000e+00</lambda>
</model>
</boost_serialization>

I tried loading one of these with the xmltodict Python module, which seemed promising. But there will still be some post-processing necessary---we'll have to recognize nodes that have n_rows, n_cols, n_elem, vec_state, and an item array as an Armadillo object and thus load it as a numpy ndarray. There may be other Python modules which are also useful for this---I haven't fully surveyed the landscape.

2. Implement some kind of different serialization. I suppose it might be possible to implement a new boost::serialization archive type which is much easier to read into Python. That will involve a lot of reading boost::serialization source and tricky debugging, but it could give a nice solution in the end.

3. Investigate other serialization libraries like cereal. cereal is a serialization library that's mostly a drop-in replacement for boost::serialization. That could be another option, but it doesn't support raw pointers---which would probably make things very hard, since we do use raw pointers throughout the mlpack codebase. Changing that could be quite an intensive refactoring project and might have hidden efficiency costs we would have to address.

Maybe there are more ideas too; I am not tied to one of those three, so long as we can address the original problem in some way. We'll also have to figure out how to document this functionality and make sure Python users are aware of it.

To me this is something of a high-importance issue, since it's a natural need in Python to figure out what is in the model.

shashankTwr commented 5 years ago

@rcurtin i would like to work on this issue.

rcurtin commented 5 years ago

@shashank-007: anyone is welcome to work on the issue but I think it would be a good idea to finish with the other issues you have been working on before approaching this one.

Nimishkhurana commented 5 years ago

@rcurtin Can I work on this issue.This will be my first issue.I have build mlpack and tried out some algorithms of Python bindings.

MuLx10 commented 5 years ago

@rcurtin can we simply pass the parameters in the dict? code If yes, I would like to implement for other models as well.

Sample output for Logistic Regression 'model_parameters': array([ 8.45899258, 70.70441776, 13.2020243 , -33.56279268, -42.39608696, -18.72453884, 9.33369732, -50.51047159, 57.38275601, 8.94633985, -34.8868858 ])

rcurtin commented 5 years ago

@MuLx10 unfortunately no, that would require custom handling for every single model type. So we need something more generic than that.

MuLx10 commented 5 years ago

@rcurtin using cereal would add a dependency, So I used xml archive followed by converting to json Sample output for Logistic Regression

{ "parameters":{"item":["32.340145733844459","58.073812365106726","8.2447012995294884","-74.424864896325332","-22.415463410875518","30.261501014799126","-49.719647192965773","-57.993374922244627","1.3846401252475675","-.0284714693657069","20.691535892381125"]}, "lambda":"0" }

rcurtin commented 5 years ago

@MuLx10 thanks, I'll review #1734 when I have a chance. :+1:

MuLx10 commented 5 years ago

Sure @rcurtin.Thanks

SinghKislay commented 5 years ago

I would like to work on this problem.

mlpack-bot[bot] commented 5 years ago

This issue has been automatically marked as stale because it has not had any recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions! :+1: