mlpack / mlpack

mlpack: a fast, header-only C++ machine learning library
https://www.mlpack.org/
Other
5.11k stars 1.61k forks source link

Make models more accessible in Python #1709

Closed rcurtin closed 3 years ago

rcurtin commented 5 years ago

This issue has to do with mlpack's Python bindings. Ideally, if you're interested in solving this, you should be familiar with how these bindings are used from Python, and have some knowledge of the automatic bindings system that generates these Python bindings. If you don't but you're still interested in this, consider looking at the Python quickstart and the automatic bindings documentation first. Also, this is an open-ended issue---I don't know the right way to solve it, and some exploration of possibilities would be needed.

One of the major drawbacks of the Python bindings is that when we build a machine learning model, it's not really inspectable by a Python user. In fact if I run the following code:

>>> import numpy as np
>>> from mlpack import logistic_regression
>>> x = np.random.rand(10, 10)
>>> y = [0 0 0 0 0 1 1 1 1 1]
>>> output = logistic_regression(training=x, labels=y, verbose=True)

then the output logistic regression model (accessed with output['output_model']) is not an easy to understand model but instead a Python object of type LogisticRegressionModel which internally just holds a pointer to the C++ memory that represents the mlpack::regression::LogisticRegression class. A Python user can't do anything with this except pass it to another call to logistic_regression() or pickle it for later use:

>>> import pickle
>>> pickle.dumps(output['output_model'])

If you take a look at the generated logistic_regression.pyx file in the src/mlpack/bindings/python/mlpack/ directory under your build directory, you can see that the model classes in Python (like LogisticRegressionModel) provide a __getstate__() and __setstate__() method for pickling, and these in turn call SerializeIn() and SerializeOut(). Those functions both use boost::serialization to output the C++ model in a binary format (perfect for pickling).

But as you can see that's not very useful for people who might want to know what the parameters of their model is. (And, the whole impetus for this issue was #1703, where I wanted to do exactly that.) Now, the logistic regression model is actually really simple---it just holds a vector of parameters. So it would be really nice if I could do something like this:

>>> output = logistic_regression(training=x, labels=y, verbose=True)
>>> model = transform_to_python(output['output_model'])
>>> model
{'parameters': [0.5, 0.6, 0.1, 0.4]}

Now maybe there are better names than transform_to_python() (in fact I am sure there are---I just can't think of any right now), but the general idea is that we give it an mlpack model, and we get back some kind of hopefully-readable dict. We don't need to go in the reverse direction right now, but it would be nice to eventually be able to do that. Sometimes a readable dict will be impossible---something like a kd-tree (which is given as part of the output models for knn() and kfn(), for instance) is really unwieldy to handle in a dict. So there will exist cases where what we give back might be totally incomprehensible, but for simpler models it should be easy to understand the output.

I thought about a few approaches to this problem:

1. Serialize to XML, and then read it in. We could write something similar to SerializeOut() that uses the xml_archive boost::serialization archive. That gives us models kind of like this:

$ cat model.xml
<?xml version="1.0" encoding="UTF-8" standalone="yes" ?>
<!DOCTYPE boost_serialization>
<boost_serialization signature="serialization::archive" version="16">
<model class_id="0" tracking_level="0" version="0">
        <parameters class_id="1" tracking_level="0" version="0">
                <n_rows>1</n_rows>
                <n_cols>3</n_cols>
                <n_elem>3</n_elem>
                <vec_state>2</vec_state>
                <item>-3.09842048843608168e-03</item>
                <item>-2.67613064852209731e-02</item>
                <item>1.63889087606279293e-02</item>
        </parameters>
        <lambda>0.00000000000000000e+00</lambda>
</model>
</boost_serialization>

I tried loading one of these with the xmltodict Python module, which seemed promising. But there will still be some post-processing necessary---we'll have to recognize nodes that have n_rows, n_cols, n_elem, vec_state, and an item array as an Armadillo object and thus load it as a numpy ndarray. There may be other Python modules which are also useful for this---I haven't fully surveyed the landscape.

2. Implement some kind of different serialization. I suppose it might be possible to implement a new boost::serialization archive type which is much easier to read into Python. That will involve a lot of reading boost::serialization source and tricky debugging, but it could give a nice solution in the end.

3. Investigate other serialization libraries like cereal. cereal is a serialization library that's mostly a drop-in replacement for boost::serialization. That could be another option, but it doesn't support raw pointers---which would probably make things very hard, since we do use raw pointers throughout the mlpack codebase. Changing that could be quite an intensive refactoring project and might have hidden efficiency costs we would have to address.

Maybe there are more ideas too; I am not tied to one of those three, so long as we can address the original problem in some way. We'll also have to figure out how to document this functionality and make sure Python users are aware of it.

To me this is something of a high-importance issue, since it's a natural need in Python to figure out what is in the model.

shashankTwr commented 5 years ago

@rcurtin i would like to work on this issue.

rcurtin commented 5 years ago

@shashank-007: anyone is welcome to work on the issue but I think it would be a good idea to finish with the other issues you have been working on before approaching this one.

Nimishkhurana commented 5 years ago

@rcurtin Can I work on this issue.This will be my first issue.I have build mlpack and tried out some algorithms of Python bindings.

MuLx10 commented 5 years ago

@rcurtin can we simply pass the parameters in the dict? code If yes, I would like to implement for other models as well.

Sample output for Logistic Regression 'model_parameters': array([ 8.45899258, 70.70441776, 13.2020243 , -33.56279268, -42.39608696, -18.72453884, 9.33369732, -50.51047159, 57.38275601, 8.94633985, -34.8868858 ])

rcurtin commented 5 years ago

@MuLx10 unfortunately no, that would require custom handling for every single model type. So we need something more generic than that.

MuLx10 commented 5 years ago

@rcurtin using cereal would add a dependency, So I used xml archive followed by converting to json Sample output for Logistic Regression

{ "parameters":{"item":["32.340145733844459","58.073812365106726","8.2447012995294884","-74.424864896325332","-22.415463410875518","30.261501014799126","-49.719647192965773","-57.993374922244627","1.3846401252475675","-.0284714693657069","20.691535892381125"]}, "lambda":"0" }

rcurtin commented 5 years ago

@MuLx10 thanks, I'll review #1734 when I have a chance. :+1:

MuLx10 commented 5 years ago

Sure @rcurtin.Thanks

SinghKislay commented 5 years ago

I would like to work on this problem.

mlpack-bot[bot] commented 5 years ago

This issue has been automatically marked as stale because it has not had any recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions! :+1:

NippunSharma commented 3 years ago

Hey, @rcurtin I see that there is no merged PR for this issue. Is this issue still open for work ? If it is then can I work on it ? I also have a doubt in one the points you mentioned in the description. You mentioned using catch instead of boost::serialization as this issue is old I understand why you said that but now that we use catch are there any changes you might want to mention in the description ? Also can you please elaborate a little on how we can use catch for the same ?

rcurtin commented 3 years ago

@NippunSharma would be great to see this one worked on again. I'm not sure where I mentioned using catch... can you point me to where I said that? In any case, we have switched from boost::serialization to cereal, and cereal supports JSON, so perhaps it is straightforward to import JSON into Python or something. There are lots of possible ideas, so feel free to play around and see what might work!

NippunSharma commented 3 years ago

@rcurtin yeah you are right... I meant cereal and not catch. Sorry for the typo here.

NippunSharma commented 3 years ago

@rcurtin parameters of which methods should the users see ? I mean for simple methods like linear_regression and logistic_regression can be displayed as vector or matrix but how can the parameters of random_forest or adaboost be displayed ? What I am trying to say here is that should we handle different methods individually or should there be a standard format using which we can display parameters for all methods ?

rcurtin commented 3 years ago

Ideally we should do this all automatically. I think it would be fine to use the JSON representation provided by cereal for each model, perhaps with a little bit of automated postprocessing. This may not be the "cleanest" representation for an end user, but it would be by far the easiest for us to implement and maintain. :+1:

NippunSharma commented 3 years ago

Hey @rcurtin I have opened #2868. There we can discuss the necessary post processing required on the JSON. Please take a look whenever you have time.

NippunSharma commented 3 years ago

@rcurtin, can we close this one?

rcurtin commented 3 years ago

Yes, we can! Thank you for reminding me and awesome work with #2868. :rocket: