predict-idlab / tsflex

Flexible time series feature extraction & processing
https://predict-idlab.github.io/tsflex/
MIT License
405 stars 26 forks source link

Question: Feature extraction on time series batch #67

Open mbignotti opened 2 years ago

mbignotti commented 2 years ago

Hello, First of all, I would like to thank you for the really nice library. I think it is much more straight forward and at the same time flexible, compared to similar libraries. I have a use case where sometimes I need to compute features in a rolling fashion, for which the window parameter of the FeatureDescriptor object is really helpful, and some other times I need to compute features on time series batches. That is, the window parameter equals the length of the entire time series. However, I'm having a few issues with the latter case. Here is an example:

import numpy as np
import pandas as pd
from tsflex.features import FeatureDescriptor, FeatureCollection

series = np.random.rand(100)
ts_index = pd.date_range(start="2022-06-09 00:00:00", periods=len(series), freq="min")
df = pd.DataFrame({"Value": series}, index=ts_index)

fc = FeatureCollection(
    FeatureDescriptor(
        function = np.mean,
        series_name="Value",
        window=len(df),
        stride=1
    )
)

fc.calculate(data=df, return_df=True)

If I run the code above, I get the following error (personal info are hidden):

Traceback (most recent call last):
  File "/****/*****/*****/***/***/***/python3.8/site-packages/tsflex/features/feature_collection.py", line 394, in calculate
    calculated_feature_list = [self._executor(idx) for idx in idxs]
  File "/****/*****/*****/***/***/***/python3.8/site-packages/tsflex/features/feature_collection.py", line 394, in <listcomp>
    calculated_feature_list = [self._executor(idx) for idx in idxs]
  File "/****/*****/*****/***/***/***/python3.8/site-packages/tsflex/features/feature_collection.py", line 208, in _executor
    stroll, function = get_stroll_func(idx)
  File "/****/*****/*****/***/***/***/python3.8/site-packages/tsflex/features/feature_collection.py", line 245, in get_stroll_function
    stroll = StridedRollingFactory.get_segmenter(**stroll_arg_dict)
  File "/****/*****/*****/***/***/***/python3.8/site-packages/tsflex/features/segmenter/strided_rolling_factory.py", line 75, in get_segmenter
    return TimeIndexSampleStridedRolling(data, window, stride, **kwargs)
  File "/****/*****/*****/***/***/***/python3.8/site-packages/tsflex/features/segmenter/strided_rolling.py", line 495, in __init__
    super().__init__(series_list, window, stride, *args, **kwargs)
  File "/****/*****/*****/***/***/***/python3.8/site-packages/tsflex/features/segmenter/strided_rolling.py", line 373, in __init__
    super().__init__(data, window, stride, *args, **kwargs)
  File "/****/*****/*****/***/***/***/python3.8/site-packages/tsflex/features/segmenter/strided_rolling.py", line 147, in __init__
    if np.ptp(container.end_indexes - container.start_indexes) != 0:
  File "<__array_function__ internals>", line 180, in ptp
  File "/****/*****/*****/***/***/***/python3.8/site-packages/numpy/core/fromnumeric.py", line 2667, in ptp
    return _methods._ptp(a, axis=axis, out=out, **kwargs)
  File "/****/*****/*****/***/***/***/python3.8/site-packages/numpy/core/_methods.py", line 278, in _ptp
    umr_maximum(a, axis, None, out, keepdims),
ValueError: zero-size array to reduction operation maximum which has no identity
---------------------------------------------------------------------------
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/***/*****/****/****/***/***/python3.8/site-packages/tsflex/features/feature_collection.py", line 418, in calculate
    raise RuntimeError(
RuntimeError: Feature Extraction halted due to error while extracting one (or multiple) feature(s)! See stack trace above.

If I specify window=len(df) - 1,it works but then, of course, it is not using the last data point in the calculation.

Am I doing something wrong? Is there a way to achieve the required behaviour?

Thanks a lot!

Environment: python==3.8.13 numpy==1.22.4 pandas==1.4.2 tsflex==0.2.3.7.7

jvdd commented 2 years ago

Hi @mbignotti,

Glad to hear that you like our package! :smile:

Thank you for identifying this bug & providing a clear explanation with reproducible code!

I guess this bug relates to another bug I identified a couple of weeks ago in #62

sequences should be segmented into n segments if there are exactly n segments possible (e.g., window=2, stride=2 => 5 segments on sequence of length 10)

This bug is a base case of the bugfix in that PR (i.e., only 1 possible segment).

I'll look into the code and try to fix the bug + add some tests for this base case.

Cheers, Jeroen

mbignotti commented 2 years ago

Hi @jvdd, Thanks a lot for your reply! I apologize for returning on this, but I've stumbled into a related problem. Consider the case where I want to compute the features on the entire time series. To do so, provided the bug above is solved, I have to pass the length of the time series as value to the argument window. Sometimes, however, I may not know in advance the length of the time series and/or it might be variable. In those cases, I would like to avoid recreating the FeatureDescriptor (or the MultipleFeatureDescriptors) object just to modify the window parameter. It would be nice to omit the parameter, or specify something like window = -1, to tell the object "calculate the features on the entire time series passed to the .calculate method".
How much difficult do you think it would be to implement this feature? Thanks again! Marco.

jvdd commented 2 years ago

Hi @mbignotti

No problem, never hesitate pinging us! Sharing feature requests / issues is a crucial part in open-source development!

I'll discuss both your comments later today with @jonasvdd & @emield12 (& update you with our opinion / steps forward)

Cheers, Jeroen

jvdd commented 2 years ago

Had a very fruitful discussion with @jonasvdd & @emield12 (and also with @jellevhb).

Based on this discussion, we will refactor the code in #71

jvdd commented 2 years ago

Hi @mbignotti, can you confirm that everything works as expected in the latest release of tsflex (v0.3)? :)

mbignotti commented 2 years ago

Hi @jvdd, I noticed that, now, if you don't specify a window in FeatureDescriptor, you have to pass segment_start_idxs and segment_end_idxs to the calculate method. It is still not clear to me what the values of those parameters should be, in order to solve the example above. Any suggestions? Thanks!

jonasvdd commented 2 years ago

Hi @mbignotti,

You are right, it is now somewhat confusing to apply this functionality. This afternoon, I had a fruitful discussion with @jvdd, and we decided this:

  1. Given your example above (see snippet ⬇️ ):
fc = FeatureCollection(
    FeatureDescriptor(
        function = np.mean,
        series_name="Value",
        window=len(df),  # this depends on the size of your dataframe
        stride=1
    )
)

The window parameter was set dynamically based on your dataframe size; this is non-optimal behaviour; i.e.:

  • this data-dependent window-size will be added to the feature-output column-name e.g., in this usecase it would be:

image

  • you cannot properly serialize the FeatureCollection as such when wanting to calculate features over varying data(sizes)
  1. A lot of boiler plating code needs to be performed by the end-users, with our current tsflex version (0.3.0) to enable this whole-series/batch feature calculation.

As such, we have decided to add the fc.calculate_unsegmented method to the FeatureCollection class:

# NOTE: how the window and stride parameters are optional. 
fc = FeatureCollection(
    FeatureDescriptor(
        function = np.mean,
        series_name="Value",
    )
)

# uses the whole (unsegmented) series of `data` to calculate 
# the features upon
fc.calculate_unsgemented(data=df, return_df=True)
mbignotti commented 2 years ago

Using two different methods could be a little bit confusing, in my opinion. After all, as you mentioned in a previous comment, computing the feature on the entire time series is a special case of the more general one where you specify a window. Personally, I would have chosen one among the following two possibilities:

  1. If window is omitted in FeatureDescriptor, you are implicitly requesting a calculation on the entire batch:
    
    # NOTE: window and stride parameters are omitted. 
    fc = FeatureCollection(
    FeatureDescriptor(
        function = np.mean,
        series_name="Value",
    )
    )

Uses the whole (unsegmented) series of data to calculate the features. The method remains the same.

fc.calculate(data=df, return_df=True)


2. If you want to compute the features on the entire batch, you need to pass the `window` parameter with a special value (e.g. `window=-1`):
```python
fc = FeatureCollection(
    FeatureDescriptor(
        function = np.mean,
        series_name="Value",
        window=-1 # Signals that we want to compute on the entire batch. Stride cannot be passed or is ignored in this case.
    )
)

# Uses the whole (unsegmented) series of `data` to calculate the features. The method remains the same.
fc.calculate(data=df, return_df=True)

The problem of having two different methods is that, in a real application (not just a notebook), you tipically have many possible configurations, and you usually want to keep the complexity at a minimum level.

However, I do not know the internals of tsflex. Hence I cannot really say which option is the best one and / or how much difficult it is to implement it. I can only judge the API from a user point of view, which is of course limited.

In any case, big thanks for your work and effort! I wish I could give more concrete contributions, but unfortunately I don't have enough time :)

jonasvdd commented 2 years ago

Hi @mbignotti,

Thank you for putting so much effort into giving your end-user API perspective, really appreciated! 🤗

The main reason @jvdd and I wanted to introduce a new method is to make things more explicit (and move some special cases away from the already lengthy calculate docstring).

I am rather intrigued by this sentence, could you elaborate more on this (maybe provide a use-case), so I can understand it better

The problem of having two different methods is that, in a real application (not just a notebook), you typically have many possible configurations, and you usually want to keep the complexity at a minimum level.


Regarding your proposed alternatives; I rather like them, and they seem rather intuitive / user-friendly. So, I will give them some thought later on!

As for now, I will create a new branch on which I expose the current, non-final, calculate_unsegmented, implementation.

jonasvdd commented 2 years ago

Hi @mbignotti! 👋🏼

Have you by any chance found the time to look at the above issue (and mentioned PR)? Would love to hear your opinion about this before we take future concrete implementation steps! 😃

Kind regards, Jonas

mbignotti commented 2 years ago

Hi @jonasvdd, I am really sorry for the late reply. I've been really busy these days. I'll try to explain what I meant with that sentence. Suppose you are trying to build a service and/or product for one or more time series tasks (e.g. forecasting, classification,...), and one of the steps involves some ts feature extraction. The features might vary from time to time, and changing the source code each time would be very inefficient. Hence, one common way to solve the problem is by using configuration files. Here is an example yaml file (not necessarily correct, it's just an example to give the idea):

FeatureDescriptor:
    - function: "np.mean"   # map somehow the string to the actual function
      series_name: "Value"
      window: null # or -1
    - function: "np.std"
      series_name: "Value"
      window: null # or -1

Then, the source code will look something like this:

with open("config.yaml", "r") as f:
    config = yaml.safe_load(f)

fc = FeatureCollection(
    FeatureDescriptor(
        **settings
    )
    for settings in config["FeatureDescriptor"]
)

fc.calculate(data=df, return_df=True)

Having two different calculate methods would imply adding some if/else branching, depending on the values provided in the configuration file. For example:

if config["FeatureDescriptor"]["window"] is None: # or config["FeatureDescriptor"]["window"] == -1
    fc.calculate_unsegmented(data=df, return_df=True)
else:
    fc.calculate(data=df, return_df=True)

Maybe, in this case, it's not a big problem, but in my opinion it's cleaner to define everything about how to perform the calculation in the FeatureDescriptor. And then call a single method to do the actual calculation. This would simplify the mapping from the config file to the python object. I hope this could help. Thanks again! Marco.

jvdd commented 1 year ago

Related issue #63