[BUG] `ShapeletTransform` sporadically returns nested `numpy` arrays as `shapelets_` fitted parameter

fkiraly commented 7 months ago

We have started to interface pyts at sktime due to popular demand.

When running the test framework, we've detected inconsistencies in the fitted parameters of ShapeletTransform, namely shapelets_ sporadically being nested numpy on our test cases.

We haven't narrowed it down to "sktime-less" code, but the issue is probably in pyts because the fitted parameters are mirrored 1:1.

Minimal reproducing code (which could be reduced further when removing the pyts adapter):

from sktime.utils import check_estimator
from sktime.transformations.panel.shapelet_transform import ShapeletTransformPyts

check_estimator(ShapeletTransformPyts, tests_to_run="test_non_state_changing_method_contract", raise_exceptions=True)

The key observation is that the test instance of ShapeletTransformPyts, which contains a ShapeletTransform instance, has a shapelets_ attribute of unexpected type (nested numpy).

I have not debugged this, but I suspect this is due to ambiguous array coercion, e.g., assigning an array to a scalar somewhere, and when an array shape lengths ends up 1, the coercion behaviour breaks.

sktime issue: https://github.com/sktime/sktime/issues/6171

johannfaouzi commented 7 months ago

The shapelets_ attribute is indeed a nested numpy.ndarray because the lengths of the extracted shapelets is not fixed (several window sizes are considered). So it's an array of arrays, but it could have been a list of arrays.

I had a look at the documentation of ShapeletTransformClassifier, and I didn't find a public attribute for the extracted shapelets. I guess that the shapelets have to saved someway (or one can just save the indices to avoid making copies) to make predictions on unseen data. I know that the data is saved in another format in sktime (pandas.DataFrame), which may be more suited to handle variable-length data.

One important current limitation of pyts is that it does not support variable-length time series (except for DTW metrics and kNN with such metrics).

What kind of changes (either in pyts or in sktime) would make this estimator more compatible with sktime API?

fkiraly commented 7 months ago

Thanks for the clarification. I'll try to answer to different points above.

the "shapelets" attribute

The shapelets_ attribute issue is not an sktime compatibility issue, as it is not violating contract assumptions.

The attribute s of ShapeletTransform, it is named shapelets_ (not ShapeletTransformClassifier).

Attributes ending in (as opposed to, starting with) underscore are considered public in the scikit-learn interface, as they store fitted parameters.

I see, so you are saying nested array (so, array with arrays inside) is intended. Then I'm surprised that not all test scenarios fail, only some. In this case, the bug would be that in some cases the attribute ends up as a nested array, in some not?

`sktime` compatibility

What kind of changes (either in pyts or in sktime) would make this estimator more compatible with sktime API?

There are no major changes necessary, pyts has a consistent interface that can easily be adapted in sktime, and it is internally consistent. Let me know if I understand sth wrong below:

pyts is not entirely sktime compliant, but it is almost sklearn compliant, because you inherit from sklearn BaseEstimator and represent your time series to be 2D numpy which is an sklearn container.

the main gap to sklearn compliance imo is systematic checking via sklearn check_estimator (this should be part of the tests). This is compliance with the strict interface assumptions for tabular (not time series) classification etc that sklearn lays out. This way of compliance is possible as long as your time series are always univariate, equal length.
the gap to sktime compliance is that sktime follows a different extension pattern (manages its own boilerplate, more flexible in the data representations), which pyts is not following. The reliance on the sklearn interface, though, prevents things like unequal length time series etc. The sktime interface and the sklearn interface are composition compaible, but not identical (because they are meant ot support different learning tasks).

Happy to have a larger discussion thread about this, and potential actions, if you are interested. I'd need to understand goals, of course.

fkiraly commented 6 months ago

As discussed above: not a problem in pyts but in sktime checks. Fixed there.

johannfaouzi / pyts