[MNT] move large datasets out of `sktime` and compress file size

sktime / sktime

A unified framework for machine learning with time series

https://www.sktime.net

BSD 3-Clause "New" or "Revised" License

7.75k stars 1.32k forks source link

[MNT] move large datasets out of `sktime` and compress file size #6985

Open fkiraly opened 1 month ago

fkiraly commented 1 month ago

19MB out of 28MB package size is due to the folder datasets.data, which contains a variety of raw data sets.

While some are conveniently small, there are obscure (and rarely) data sets for time series classification that individually take up 1MB or similar.

I would suggest we move these into a new package sktime-datasets, which is a soft dependency and has contents equivalent to the datasets folder.

Data sets used frequently in tutorials, docstring examples, and historical blogposts should remain in the main package.

Thoughts, @sktime/core-developers?

geetu040 commented 4 weeks ago

Instead of cutting the data, why don't we push these datasets to proper places where they should be i.e huggingface or kaggle, under the sktime organization, where these datasets will be different from original sources, as they have been preprocessed, cleaned and converted to sktime format. And then internally in sktime, download from these sources when the user tries to load them for the first time? That way, we can leverage full datasets and have 0 footprint on sktime package.

fkiraly commented 4 weeks ago

The problem with that solution is availability in the environment vs from the environment.

For frequently used demo datasets, we want them to be in the environment, after they have been installed as part of the package. Otherwise we rely on downloads from an external place.

The reason is, (a) not everyone has access to external downloads, consider locked down environments in high-security setups such as finance institutions; and (b) downloads are always error prone, more so than "load from installed environment", as they may be subject to URL interception attacks or server downtimes.

Plus, for scientific auditability, we'd want to avoid "forking the data", i.e., creating a new data artefact strictly speaking, and instead keep reformatting and cleaning in python code.

fkiraly commented 4 weeks ago

What you propose is, of course, a valid solution for infrequently used datasets, or extraneous datasets, e.g., those that do not appear in any docstring example or tutorial.

Though, I proposed an sktime-datasets package since it still allows use as part of the environment instead of relying on a nexternal download.