Open mbignotti opened 2 years ago
Hi @mbignotti,
Glad to hear that you like our package! :smile:
Thank you for identifying this bug & providing a clear explanation with reproducible code!
I guess this bug relates to another bug I identified a couple of weeks ago in #62
sequences should be segmented into n segments if there are exactly n segments possible (e.g., window=2, stride=2 => 5 segments on sequence of length 10)
This bug is a base case of the bugfix in that PR (i.e., only 1 possible segment).
I'll look into the code and try to fix the bug + add some tests for this base case.
Cheers, Jeroen
Hi @jvdd,
Thanks a lot for your reply!
I apologize for returning on this, but I've stumbled into a related problem.
Consider the case where I want to compute the features on the entire time series. To do so, provided the bug above is solved, I have to pass the length of the time series as value to the argument window
.
Sometimes, however, I may not know in advance the length of the time series and/or it might be variable.
In those cases, I would like to avoid recreating the FeatureDescriptor
(or the MultipleFeatureDescriptors
) object just to modify the window
parameter. It would be nice to omit the parameter, or specify something like window = -1
, to tell the object "calculate the features on the entire time series passed to the .calculate
method".
How much difficult do you think it would be to implement this feature?
Thanks again!
Marco.
Hi @mbignotti
No problem, never hesitate pinging us! Sharing feature requests / issues is a crucial part in open-source development!
I'll discuss both your comments later today with @jonasvdd & @emield12 (& update you with our opinion / steps forward)
Cheers, Jeroen
Had a very fruitful discussion with @jonasvdd & @emield12 (and also with @jellevhb).
Based on this discussion, we will refactor the code in #71
Hi @mbignotti, can you confirm that everything works as expected in the latest release of tsflex (v0.3)? :)
Hi @jvdd,
I noticed that, now, if you don't specify a window
in FeatureDescriptor
, you have to pass segment_start_idxs
and segment_end_idxs
to the calculate
method.
It is still not clear to me what the values of those parameters should be, in order to solve the example above.
Any suggestions?
Thanks!
Hi @mbignotti,
You are right, it is now somewhat confusing to apply this functionality. This afternoon, I had a fruitful discussion with @jvdd, and we decided this:
fc = FeatureCollection(
FeatureDescriptor(
function = np.mean,
series_name="Value",
window=len(df), # this depends on the size of your dataframe
stride=1
)
)
The
window
parameter was set dynamically based on your dataframe size; this is non-optimal behaviour; i.e.:
- this data-dependent window-size will be added to the feature-output column-name e.g., in this usecase it would be:
- you cannot properly serialize the
FeatureCollection
as such when wanting to calculate features over varyingdata
(sizes)
tsflex
version (0.3.0) to enable this whole-series/batch feature calculation.As such, we have decided to add the fc.calculate_unsegmented
method to the FeatureCollection
class:
# NOTE: how the window and stride parameters are optional.
fc = FeatureCollection(
FeatureDescriptor(
function = np.mean,
series_name="Value",
)
)
# uses the whole (unsegmented) series of `data` to calculate
# the features upon
fc.calculate_unsgemented(data=df, return_df=True)
Using two different methods could be a little bit confusing, in my opinion. After all, as you mentioned in a previous comment, computing the feature on the entire time series is a special case of the more general one where you specify a window. Personally, I would have chosen one among the following two possibilities:
window
is omitted in FeatureDescriptor
, you are implicitly requesting a calculation on the entire batch:
# NOTE: window and stride parameters are omitted.
fc = FeatureCollection(
FeatureDescriptor(
function = np.mean,
series_name="Value",
)
)
data
to calculate the features. The method remains the same.fc.calculate(data=df, return_df=True)
2. If you want to compute the features on the entire batch, you need to pass the `window` parameter with a special value (e.g. `window=-1`):
```python
fc = FeatureCollection(
FeatureDescriptor(
function = np.mean,
series_name="Value",
window=-1 # Signals that we want to compute on the entire batch. Stride cannot be passed or is ignored in this case.
)
)
# Uses the whole (unsegmented) series of `data` to calculate the features. The method remains the same.
fc.calculate(data=df, return_df=True)
The problem of having two different methods is that, in a real application (not just a notebook), you tipically have many possible configurations, and you usually want to keep the complexity at a minimum level.
However, I do not know the internals of tsflex. Hence I cannot really say which option is the best one and / or how much difficult it is to implement it. I can only judge the API from a user point of view, which is of course limited.
In any case, big thanks for your work and effort! I wish I could give more concrete contributions, but unfortunately I don't have enough time :)
Hi @mbignotti,
Thank you for putting so much effort into giving your end-user API perspective, really appreciated! 🤗
The main reason @jvdd and I wanted to introduce a new method is to make things more explicit (and move some special cases away from the already lengthy calculate
docstring).
I am rather intrigued by this sentence, could you elaborate more on this (maybe provide a use-case), so I can understand it better
The problem of having two different methods is that, in a real application (not just a notebook), you typically have many possible configurations, and you usually want to keep the complexity at a minimum level.
Regarding your proposed alternatives; I rather like them, and they seem rather intuitive / user-friendly. So, I will give them some thought later on!
As for now, I will create a new branch on which I expose the current, non-final, calculate_unsegmented
, implementation.
Hi @mbignotti! 👋🏼
Have you by any chance found the time to look at the above issue (and mentioned PR)? Would love to hear your opinion about this before we take future concrete implementation steps! 😃
Kind regards, Jonas
Hi @jonasvdd, I am really sorry for the late reply. I've been really busy these days. I'll try to explain what I meant with that sentence. Suppose you are trying to build a service and/or product for one or more time series tasks (e.g. forecasting, classification,...), and one of the steps involves some ts feature extraction. The features might vary from time to time, and changing the source code each time would be very inefficient. Hence, one common way to solve the problem is by using configuration files. Here is an example yaml file (not necessarily correct, it's just an example to give the idea):
FeatureDescriptor:
- function: "np.mean" # map somehow the string to the actual function
series_name: "Value"
window: null # or -1
- function: "np.std"
series_name: "Value"
window: null # or -1
Then, the source code will look something like this:
with open("config.yaml", "r") as f:
config = yaml.safe_load(f)
fc = FeatureCollection(
FeatureDescriptor(
**settings
)
for settings in config["FeatureDescriptor"]
)
fc.calculate(data=df, return_df=True)
Having two different calculate
methods would imply adding some if/else
branching, depending on the values provided in the configuration file. For example:
if config["FeatureDescriptor"]["window"] is None: # or config["FeatureDescriptor"]["window"] == -1
fc.calculate_unsegmented(data=df, return_df=True)
else:
fc.calculate(data=df, return_df=True)
Maybe, in this case, it's not a big problem, but in my opinion it's cleaner to define everything about how to perform the calculation in the FeatureDescriptor
. And then call a single method to do the actual calculation. This would simplify the mapping from the config file to the python object.
I hope this could help.
Thanks again!
Marco.
Related issue #63
Hello, First of all, I would like to thank you for the really nice library. I think it is much more straight forward and at the same time flexible, compared to similar libraries. I have a use case where sometimes I need to compute features in a rolling fashion, for which the
window
parameter of theFeatureDescriptor
object is really helpful, and some other times I need to compute features on time series batches. That is, the window parameter equals the length of the entire time series. However, I'm having a few issues with the latter case. Here is an example:If I run the code above, I get the following error (personal info are hidden):
If I specify
window=len(df) - 1
,it works but then, of course, it is not using the last data point in the calculation.Am I doing something wrong? Is there a way to achieve the required behaviour?
Thanks a lot!
Environment: python==3.8.13 numpy==1.22.4 pandas==1.4.2 tsflex==0.2.3.7.7