mlgig / mrsqm

GNU General Public License v3.0
28 stars 8 forks source link

`MrSQMClassifier` non-compliance with `sktime` serialization interface #7

Open fkiraly opened 1 year ago

fkiraly commented 1 year ago

@lnthach, @heerme, I've tried integrating MrSQM to distribute it with sktime (by interface to this package), also as a proof-of-concept of how we could back-integrate MrSEQL. Issue for MrSQM: https://github.com/sktime/sktime/issues/4338 Issue for MrSEQL: https://github.com/sktime/sktime/issues/4296

The PR https://github.com/sktime/sktime/pull/4337 works in that it proves that we can integrate cython based estimators with sktime tests and the up-to-date framework, if those estimators are in an external package such as mrsql (this one).

Now, with the tests working, it would appear that MrSQM is failing two contract tests:

I assume the same tests would fail if you run check_estimator, i.e.,

from sktime.utils.estimator_checks import check_estimator

check_estimator(MrSQMClassifier, raise_exceptions=True)

How do we proceed from here?

Problem 1: serialization

I'm not too familiar with cython, the exception when serializing (saving to binary) is

TypeError: no default __reduce__ due to non-trivial __cinit__

To make it work, I suspect one would have to implement serialization via the higher-level interface (save and load) or via the lower-level interface, __reduce__. @achieveordie has done that previously for classes of estimators, maybe he has more insight.

To provoke the error without the test suite, do pickle(my_estimator) on a fitted instance of MrSQMClassifier (that should work without complaining for an easy solution via the lower-level interface).

Problem 2: fitted parameters

The other issue is get_fitted_params, this should be implemented either directly or via inheritance from BaseClassifier - that can be solved on the sktime side though.

lnthach commented 1 year ago

Thanks @fkiraly,

Re problem 1, I was actually able to pickle MrSQMClassifier, however, only if SFA is not used. For example, this works for me.

clf = MrSQMClassifier(nsax=1,nsfa=0).fit(X_train, y_train)
pickle.dumps(clf)

But this provoked the same TypeError

clf = MrSQMClassifier(nsax=0,nsfa=1).fit(X_train, y_train)
pickle.dumps(clf)

There is not much difference between the SAX and SFA Cython wrappers code so I suspect the problem is in the C++ code. A quick and easy solution would be replacing the SFA C++ implementation with the new Python implementation (https://github.com/patrickzib/dictionary). This will slow down MrSQM significantly though.

fkiraly commented 1 year ago

Thanks for identifying the root cause.

I patched over the two issues in https://github.com/sktime/sktime/pull/4337 now, as a consequence you cannot retrieve fitted parameters or use nsfa greater than 0 with the full contract.

It can be released that way and is scheduled for 0.17.0.

We should leave this open until this is fixed in this package, and the two patches are removed in sktime.

Patch revert needs to be applied in three places in sktime (the MrSQM classifier):

fkiraly commented 1 year ago

May I hand to you the reponsibility to maintain the sktime interface from 0.17.0 on?

(e.g., once you fix the interface non-compliance in mrsqm and have that released, revert the patches too)

fkiraly commented 1 year ago

A quick and easy solution would be replacing the SFA C++ implementation with the new Python implementation (https://github.com/patrickzib/dictionary). This will slow down MrSQM significantly though.

I think speed is still sth you'd want to maintain. I think you just need to produce a suitable __reduce__ method, that might just be a few lines. Not sure how to do that in C code though.

fkiraly commented 1 year ago

may I kindly ask if there are plans (if yes timeline) on resolving this? Need any help?

fkiraly commented 1 year ago

?

lnthach commented 1 year ago

Hi @fkiraly ,

I just revisited this issue and found a solution. The problem is that MrSQMClassifier stores Cython objects (PySFA) that contain pointers. I made several attempts with the reduce function as well as getstate and setstate but couldn't make it work. The solution I found is that to only store the necessary information in a Python array and initiate the objects on the fly. This should work with pickle now. I also updated the Pypi package. On a different note, I also have put MrSEQL on Pypi (https://pypi.org/project/mrseql/).

fkiraly commented 1 year ago

Great!

Apologies for the late reply.

The solution I found is that to only store the necessary information in a Python array and initiate the objects on the fly. This should work with pickle now. I

Ok, that should work - I'll try to remove the patch, lower bound the required mrsqm version, and see what the tests say.

Reference issue: https://github.com/sktime/sktime/issues/4396 Downstream PR that enables tests: https://github.com/sktime/sktime/pull/5171

On a different note, I also have put MrSEQL on Pypi (https://pypi.org/project/mrseql/).

Excellent!

The sktime issue is here: https://github.com/sktime/sktime/issues/4296

How do we proceed with this? I think there are still users who'd like this back.

It should be easy to whip up an interface to make it searchable in sktime, just copy-pasting mrsqm. Would you like to do that, or would you prefer someone else does it?

Update: naive replication of MrSQM recipe here https://github.com/sktime/sktime/pull/5178

fkiraly commented 1 year ago

@lnthach, quick question: have you addressed both problem 1 (pickle) and 2 (parameters), or only problem 1? By intention, I mean.

fkiraly commented 1 year ago

Looks like the pickling bug is fixed!

This unlocks further tests, which seem to indicate there is an issue with the random_state contract, I've opened a new issue: https://github.com/mlgig/mrsqm/issues/11

lnthach commented 1 year ago

@lnthach, quick question: have you addressed both problem 1 (pickle) and 2 (parameters), or only problem 1? By intention, I mean

I only addressed problem 1. Re problem 2 sorry if it's a silly question, what need to be done to fix this problem ?

fkiraly commented 1 year ago

Re problem 2 sorry if it's a silly question, what need to be done to fix this problem ?

Your estimator has to implement a method get_fitted_params which returns a str keyed dict of parameters once the estimator is fitted. The return can be the empty dict if there are no fitted parameters to show to the user.

The "quickest" fix would be to add this method to the class:

def get_fitted_params(self):
    return {}

though of course you may like to return fitted parameters to show to the user if there are any of interest.