sklearn-like API - Githubissues

dekuenstle commented 6 years ago

The API provided by all machine learning models in the scikit-learn library became defacto industry standard for machine learning models in python. We should also provide a similar api for ndl or even make it the standard, because:

People knowing other libraries can easily start with this one.
It is clear and well-tried
NDL can be used together with a lot of other libs which depend on this api e.g. for hyperparameter tuning without any changes

Pseudocode

from pyndl import Ndl, Vectorizer
# ...
cue_vec, outcome_vec = Vectorizer(), Vectorizer()
X = cue_vec.fit_transform(cues)
y = outcome_vec.fit_transform(cues)

model = Ndl(alpha=0.01)
model.fit(X, y)
# ...
X_test = cue_vec.transform(["#_this_is_a_test_#"])
y_pred = model.predict(X_test)
outcome_pred = outcome_vec.inverse_transform(y_pred)
# ...

This API cleanup would definitly be some work, but I think it is worth it. Most of them would just wrap around the existing one.

What do you think about this?

Trybnetic commented 6 years ago

I think it is a good idea!

But would you suggest writing only a wrapper or to restructure the whole project?

I would suggest adding an api.pymodule in which we define the classes and function you mentioned and import them into pyndl/__init__.py to be callable as you described.

What do you think?

derNarr commented 6 years ago

For me in a way the sklearn API never felt very pythonic. Creating objects and then calling methods on them to manipulate them inplace feels more like Ruby, otherwise it is the defacto standard in the python learning community and we should fit into them.

I would like to do a little bit more with sklearn before I decide, if we want to have a wrapper to the sklearn api while maintaining a separate api for us or if we only want to support the sklearn api and move to it completely.

@Trybnetic we should have an offline discussion about that.

The biggest conceptional change seems to be to not store all the information in the weights matrix with its attributes, but to have a model object that stores this information. For me it would be crucial how to serialize this model object and how to load it into R.

At the same time a python object as the model gives us more freedoms to have something like ndl_plus modeled and serialized the same way as ndl, which would be an advantage.

Questions that need to be answered:

[ ] How does sklearn serializes learned models?
[ ] How is meta data like cpu usage, host name, computing date stored in sklearn?

dekuenstle commented 6 years ago

Let me answer the questions:

How does sklearn serializes learned models?

Usually you serialize models in sklearn using pickle (docs). Therefore it is problematic to apply them on another framework, language or even python version.

How is meta data like cpu usage, host name, computing date stored in sklearn?

They aren't. They recommend you to store python and lib versions, data, etc. yourself to reproduce the results.

keras (high level api for tensorflow / theano tensor processing frameworks) has an more interesting approach:

model definition, hyperparameter, metadata can be stored as JSON
weights can be stored as h5 or numpy npy format
additionally they can be stored combined as an h5 format. Therefore you can reproduce the trained model by the json+weight files or the h5 file and the file formats could be read by other libraries (in theory at least).

quantling / pyndl

sklearn-like API #130