sktime / pytorch-forecasting

Time series forecasting with PyTorch
https://pytorch-forecasting.readthedocs.io/
MIT License
4k stars 632 forks source link

question: how to load data from timeseries of multiple devices #149

Closed geoHeil closed 3 years ago

geoHeil commented 4 years ago

I have a data set which consists of many multivariate time series (i.e. time series with > 1 value per timestamp originating from many IoT devices).

How can I load such a dataset to pytorch using your https://pytorch-forecasting.readthedocs.io/en/latest/data.html data loader - or do I need to implement my own? I need to ensure that the data is interpreted in the right way to allow the LSTM to learn patterns from an individual time-series / window and include information from multiple devices / time windows in a batch.

I would want to use it for an LSTM-autoencoder to perform anomaly detection.

image


    import pandas as pd
      from pandas import Timestamp
      df = pd.DataFrame({'hour': {0: Timestamp('2020-01-01 00:00:00'), 1: Timestamp('2020-01-01 00:00:00'), 2: Timestamp('2020-01-01 00:00:00'), 3: Timestamp('2020-01-01 00:00:00'), 4: Timestamp('2020-01-01 00:00:00'), 5: Timestamp('2020-01-01 01:00:00'), 6: Timestamp('2020-01-01 01:00:00'), 7: Timestamp('2020-01-01 01:00:00'), 8: Timestamp('2020-01-01 01:00:00'), 9: Timestamp('2020-01-01 01:00:00')}, 'metrik_0': {0: 2.020883621337143, 1: 2.808770093182167, 2: 2.5267618429653402, 3: 3.2709845883575346, 4: 3.7984105853602235, 5: 4.0385160093937795, 6: 4.643267594258785, 7: 1.3012379179114388, 8: 3.509304898336378, 9: 2.8664748765561208}, 'metrik_1': {0: 4.580434685779621, 1: 2.933188328317023, 2: 3.999229120882797, 3: 2.9099857745449706, 4: 4.6302055552849, 5: 4.012670194672169, 6: 3.697352153313931, 7: 4.855210603371005, 8: 2.2197913449032254, 9: 2.393605868973481}, 'metrik_2': {0: 3.680527279150989, 1: 2.511065648719921, 2: 3.8350007982479113, 3: 2.4063786290320333, 4: 3.231433617897482, 5: 3.8505378854180115, 6: 5.359150077287063, 7: 2.8966469424805386, 8: 4.554080028058399, 9: 3.3319064764061914}, 'cohort_id': {0: 1, 1: 2, 2: 1, 3: 2, 4: 2, 5: 1, 6: 2, 7: 2, 8: 1, 9: 2}, 'device_id': {0: 1, 1: 3, 2: 4, 3: 2, 4: 5, 5: 4, 6: 3, 7: 2, 8: 1, 9: 5}})
jdb78 commented 4 years ago

The short answer is yes.

from pytorch_forecasting import TimeSeriesDataSet

dataset = TimeSeriesDataSet(
    df.assign(is_anomaly=0, time_idx=lambda x: ((x.hour - x.hour.min()).dt.seconds / 3600).astype(int)),
    max_encoder_length=0,  # do not encode
    max_prediction_length=1,  # guess you want to set this to a higher number and also specify min_prediction_length
    group_ids=["cohort_id", "device_id"],
    time_idx="time_idx",
    target="is_anomaly",
    time_varying_known_reals=["metrik_0", "metrik_1", "metrik_2"],
)

# test the dataloader
x, y = next(iter(dataset.to_dataloader()))
x.keys()
x["decoder_cont"]

Now you can use x["decoder_cont"] to train your autoencoder and later for validation use is_anomaly later to check if the anomaly detection works. One potential downside you might want to be aware of is that metrik_0, metrik_1 and metrik_2 are z-score normalized accross all values. You can normalize by each timeseries by using a GroupNormalizer but, currently, there is no way to normalize each subsequence based on on their own. However, of course, you could do that in a PyTorch Module yourself.

geoHeil commented 4 years ago

Many thanks! Does max_prediction_length correspond to the window length of data fed to the autoencoder?

For generating labels you used: df.assign(is_anomaly=0, time_idx=lambda x: ((x.hour - x.hour.min()).dt.seconds / 3600).astype(int)),, I could create a dataset from true (but noisy labels) instead.

Why do you chooose: max_encoder_length=0, # do not encode to set it to 0?

Thanks for pointing this out:

One potential downside you might want to be aware of is that metrik_0, metrik_1 and metrik_2 are z-score normalized accross > all values. You can normalize by each timeseries by using a GroupNormalizer but, currently, there is no way to normalize each subsequence based on on their own. However, of course, > you could do that in a PyTorch Module yourself.

could you link a line where I would need to start digging from / changing the code / integrating a custom module?

jdb78 commented 4 years ago

I have not leveraged PyTorch Forecasting for autoencoders, but you should definitely be able to do so.

Resouces to look at are current model implementations and the BaseModel

I think if you want to use PyTorch Forecasting all the way, I believe taking those steps should do it:

geoHeil commented 3 years ago

Many thanks. This is really interesting. However, why is manual feature engineering required, i.e. why do I manually need to create the sliding windows? I know that in other disciplines, such as NLP, whole documents can be fed in and the network will automatically derive distance-based features using attention (transformer, bert). Are you aware of something similar for time-series?

jdb78 commented 3 years ago

When training, you need to work with sliding windows (if only for memory reasons). Of course, Your test set can work with an infinite encoder length or prediction length so that the whole time series is parsed in one go.

However, I can think of reasons why this might be problematic in time series forecasting vs NLP: