Closed rcurtin closed 3 years ago
@rcurtin i would like to work on this issue.
@shashank-007: anyone is welcome to work on the issue but I think it would be a good idea to finish with the other issues you have been working on before approaching this one.
@rcurtin Can I work on this issue.This will be my first issue.I have build mlpack and tried out some algorithms of Python bindings.
@rcurtin can we simply pass the parameters in the dict? code If yes, I would like to implement for other models as well.
Sample output for Logistic Regression
'model_parameters': array([ 8.45899258, 70.70441776, 13.2020243 , -33.56279268, -42.39608696, -18.72453884, 9.33369732, -50.51047159, 57.38275601, 8.94633985, -34.8868858 ])
@MuLx10 unfortunately no, that would require custom handling for every single model type. So we need something more generic than that.
@rcurtin using cereal would add a dependency, So I used xml archive followed by converting to json Sample output for Logistic Regression
{ "parameters":{"item":["32.340145733844459","58.073812365106726","8.2447012995294884","-74.424864896325332","-22.415463410875518","30.261501014799126","-49.719647192965773","-57.993374922244627","1.3846401252475675","-.0284714693657069","20.691535892381125"]},
"lambda":"0" }
@MuLx10 thanks, I'll review #1734 when I have a chance. :+1:
Sure @rcurtin.Thanks
I would like to work on this problem.
This issue has been automatically marked as stale because it has not had any recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions! :+1:
Hey, @rcurtin I see that there is no merged PR for this issue. Is this issue still open for work ? If it is then can I work on it ?
I also have a doubt in one the points you mentioned in the description. You mentioned using catch
instead of boost::serialization
as this issue is old I understand why you said that but now that we use catch
are there any changes you might want to mention in the description ? Also can you please elaborate a little on how we can use catch
for the same ?
@NippunSharma would be great to see this one worked on again. I'm not sure where I mentioned using catch
... can you point me to where I said that? In any case, we have switched from boost::serialization
to cereal
, and cereal
supports JSON, so perhaps it is straightforward to import JSON into Python or something. There are lots of possible ideas, so feel free to play around and see what might work!
@rcurtin yeah you are right... I meant cereal and not catch. Sorry for the typo here.
@rcurtin parameters of which methods should the users see ? I mean for simple methods like linear_regression and logistic_regression can be displayed as vector or matrix but how can the parameters of random_forest or adaboost be displayed ? What I am trying to say here is that should we handle different methods individually or should there be a standard format using which we can display parameters for all methods ?
Ideally we should do this all automatically. I think it would be fine to use the JSON representation provided by cereal for each model, perhaps with a little bit of automated postprocessing. This may not be the "cleanest" representation for an end user, but it would be by far the easiest for us to implement and maintain. :+1:
Hey @rcurtin I have opened #2868. There we can discuss the necessary post processing required on the JSON. Please take a look whenever you have time.
@rcurtin, can we close this one?
Yes, we can! Thank you for reminding me and awesome work with #2868. :rocket:
This issue has to do with mlpack's Python bindings. Ideally, if you're interested in solving this, you should be familiar with how these bindings are used from Python, and have some knowledge of the automatic bindings system that generates these Python bindings. If you don't but you're still interested in this, consider looking at the Python quickstart and the automatic bindings documentation first. Also, this is an open-ended issue---I don't know the right way to solve it, and some exploration of possibilities would be needed.
One of the major drawbacks of the Python bindings is that when we build a machine learning model, it's not really inspectable by a Python user. In fact if I run the following code:
then the output logistic regression model (accessed with
output['output_model']
) is not an easy to understand model but instead a Python object of typeLogisticRegressionModel
which internally just holds a pointer to the C++ memory that represents themlpack::regression::LogisticRegression
class. A Python user can't do anything with this except pass it to another call tologistic_regression()
or pickle it for later use:If you take a look at the generated
logistic_regression.pyx
file in thesrc/mlpack/bindings/python/mlpack/
directory under your build directory, you can see that the model classes in Python (likeLogisticRegressionModel
) provide a__getstate__()
and__setstate__()
method for pickling, and these in turn callSerializeIn()
andSerializeOut()
. Those functions both use boost::serialization to output the C++ model in a binary format (perfect for pickling).But as you can see that's not very useful for people who might want to know what the parameters of their model is. (And, the whole impetus for this issue was #1703, where I wanted to do exactly that.) Now, the logistic regression model is actually really simple---it just holds a vector of parameters. So it would be really nice if I could do something like this:
Now maybe there are better names than
transform_to_python()
(in fact I am sure there are---I just can't think of any right now), but the general idea is that we give it an mlpack model, and we get back some kind of hopefully-readable dict. We don't need to go in the reverse direction right now, but it would be nice to eventually be able to do that. Sometimes a readable dict will be impossible---something like a kd-tree (which is given as part of the output models forknn()
andkfn()
, for instance) is really unwieldy to handle in a dict. So there will exist cases where what we give back might be totally incomprehensible, but for simpler models it should be easy to understand the output.I thought about a few approaches to this problem:
1. Serialize to XML, and then read it in. We could write something similar to
SerializeOut()
that uses thexml_archive
boost::serialization archive. That gives us models kind of like this:I tried loading one of these with the
xmltodict
Python module, which seemed promising. But there will still be some post-processing necessary---we'll have to recognize nodes that haven_rows
,n_cols
,n_elem
,vec_state
, and anitem
array as an Armadillo object and thus load it as a numpyndarray
. There may be other Python modules which are also useful for this---I haven't fully surveyed the landscape.2. Implement some kind of different serialization. I suppose it might be possible to implement a new boost::serialization archive type which is much easier to read into Python. That will involve a lot of reading boost::serialization source and tricky debugging, but it could give a nice solution in the end.
3. Investigate other serialization libraries like cereal. cereal is a serialization library that's mostly a drop-in replacement for boost::serialization. That could be another option, but it doesn't support raw pointers---which would probably make things very hard, since we do use raw pointers throughout the mlpack codebase. Changing that could be quite an intensive refactoring project and might have hidden efficiency costs we would have to address.
Maybe there are more ideas too; I am not tied to one of those three, so long as we can address the original problem in some way. We'll also have to figure out how to document this functionality and make sure Python users are aware of it.
To me this is something of a high-importance issue, since it's a natural need in Python to figure out what is in the model.