johannfaouzi / pyts

A Python package for time series classification
https://pyts.readthedocs.io
BSD 3-Clause "New" or "Revised" License
1.77k stars 164 forks source link

[BUG] `ShapeletTransform` sporadically returns nested `numpy` arrays as `shapelets_` fitted parameter #158

Closed fkiraly closed 6 months ago

fkiraly commented 7 months ago

We have started to interface pyts at sktime due to popular demand.

When running the test framework, we've detected inconsistencies in the fitted parameters of ShapeletTransform, namely shapelets_ sporadically being nested numpy on our test cases.

We haven't narrowed it down to "sktime-less" code, but the issue is probably in pyts because the fitted parameters are mirrored 1:1.

Minimal reproducing code (which could be reduced further when removing the pyts adapter):

from sktime.utils import check_estimator
from sktime.transformations.panel.shapelet_transform import ShapeletTransformPyts

check_estimator(ShapeletTransformPyts, tests_to_run="test_non_state_changing_method_contract", raise_exceptions=True)

The key observation is that the test instance of ShapeletTransformPyts, which contains a ShapeletTransform instance, has a shapelets_ attribute of unexpected type (nested numpy).

I have not debugged this, but I suspect this is due to ambiguous array coercion, e.g., assigning an array to a scalar somewhere, and when an array shape lengths ends up 1, the coercion behaviour breaks.

sktime issue: https://github.com/sktime/sktime/issues/6171

johannfaouzi commented 7 months ago

The shapelets_ attribute is indeed a nested numpy.ndarray because the lengths of the extracted shapelets is not fixed (several window sizes are considered). So it's an array of arrays, but it could have been a list of arrays.

I had a look at the documentation of ShapeletTransformClassifier, and I didn't find a public attribute for the extracted shapelets. I guess that the shapelets have to saved someway (or one can just save the indices to avoid making copies) to make predictions on unseen data. I know that the data is saved in another format in sktime (pandas.DataFrame), which may be more suited to handle variable-length data.

One important current limitation of pyts is that it does not support variable-length time series (except for DTW metrics and kNN with such metrics).

What kind of changes (either in pyts or in sktime) would make this estimator more compatible with sktime API?

fkiraly commented 7 months ago

Thanks for the clarification. I'll try to answer to different points above.

the "shapelets" attribute

The shapelets_ attribute issue is not an sktime compatibility issue, as it is not violating contract assumptions.

The attribute s of ShapeletTransform, it is named shapelets_ (not ShapeletTransformClassifier).

Attributes ending in (as opposed to, starting with) underscore are considered public in the scikit-learn interface, as they store fitted parameters.

I see, so you are saying nested array (so, array with arrays inside) is intended. Then I'm surprised that not all test scenarios fail, only some. In this case, the bug would be that in some cases the attribute ends up as a nested array, in some not?

sktime compatibility

What kind of changes (either in pyts or in sktime) would make this estimator more compatible with sktime API?

There are no major changes necessary, pyts has a consistent interface that can easily be adapted in sktime, and it is internally consistent. Let me know if I understand sth wrong below:

pyts is not entirely sktime compliant, but it is almost sklearn compliant, because you inherit from sklearn BaseEstimator and represent your time series to be 2D numpy which is an sklearn container.

Happy to have a larger discussion thread about this, and potential actions, if you are interested. I'd need to understand goals, of course.

fkiraly commented 6 months ago

As discussed above: not a problem in pyts but in sktime checks. Fixed there.