question: how to load data from timeseries of multiple devices

geoHeil commented 4 years ago

I have a data set which consists of many multivariate time series (i.e. time series with > 1 value per timestamp originating from many IoT devices).

How can I load such a dataset to pytorch using your https://pytorch-forecasting.readthedocs.io/en/latest/data.html data loader - or do I need to implement my own? I need to ensure that the data is interpreted in the right way to allow the LSTM to learn patterns from an individual time-series / window and include information from multiple devices / time windows in a batch.

I would want to use it for an LSTM-autoencoder to perform anomaly detection.


    import pandas as pd
      from pandas import Timestamp
      df = pd.DataFrame({'hour': {0: Timestamp('2020-01-01 00:00:00'), 1: Timestamp('2020-01-01 00:00:00'), 2: Timestamp('2020-01-01 00:00:00'), 3: Timestamp('2020-01-01 00:00:00'), 4: Timestamp('2020-01-01 00:00:00'), 5: Timestamp('2020-01-01 01:00:00'), 6: Timestamp('2020-01-01 01:00:00'), 7: Timestamp('2020-01-01 01:00:00'), 8: Timestamp('2020-01-01 01:00:00'), 9: Timestamp('2020-01-01 01:00:00')}, 'metrik_0': {0: 2.020883621337143, 1: 2.808770093182167, 2: 2.5267618429653402, 3: 3.2709845883575346, 4: 3.7984105853602235, 5: 4.0385160093937795, 6: 4.643267594258785, 7: 1.3012379179114388, 8: 3.509304898336378, 9: 2.8664748765561208}, 'metrik_1': {0: 4.580434685779621, 1: 2.933188328317023, 2: 3.999229120882797, 3: 2.9099857745449706, 4: 4.6302055552849, 5: 4.012670194672169, 6: 3.697352153313931, 7: 4.855210603371005, 8: 2.2197913449032254, 9: 2.393605868973481}, 'metrik_2': {0: 3.680527279150989, 1: 2.511065648719921, 2: 3.8350007982479113, 3: 2.4063786290320333, 4: 3.231433617897482, 5: 3.8505378854180115, 6: 5.359150077287063, 7: 2.8966469424805386, 8: 4.554080028058399, 9: 3.3319064764061914}, 'cohort_id': {0: 1, 1: 2, 2: 1, 3: 2, 4: 2, 5: 1, 6: 2, 7: 2, 8: 1, 9: 2}, 'device_id': {0: 1, 1: 3, 2: 4, 3: 2, 4: 5, 5: 4, 6: 3, 7: 2, 8: 1, 9: 5}})

jdb78 commented 4 years ago

The short answer is yes.

from pytorch_forecasting import TimeSeriesDataSet

dataset = TimeSeriesDataSet(
    df.assign(is_anomaly=0, time_idx=lambda x: ((x.hour - x.hour.min()).dt.seconds / 3600).astype(int)),
    max_encoder_length=0,  # do not encode
    max_prediction_length=1,  # guess you want to set this to a higher number and also specify min_prediction_length
    group_ids=["cohort_id", "device_id"],
    time_idx="time_idx",
    target="is_anomaly",
    time_varying_known_reals=["metrik_0", "metrik_1", "metrik_2"],
)

# test the dataloader
x, y = next(iter(dataset.to_dataloader()))
x.keys()
x["decoder_cont"]

Now you can use x["decoder_cont"] to train your autoencoder and later for validation use is_anomaly later to check if the anomaly detection works. One potential downside you might want to be aware of is that metrik_0, metrik_1 and metrik_2 are z-score normalized accross all values. You can normalize by each timeseries by using a GroupNormalizer but, currently, there is no way to normalize each subsequence based on on their own. However, of course, you could do that in a PyTorch Module yourself.

geoHeil commented 4 years ago

Many thanks! Does max_prediction_length correspond to the window length of data fed to the autoencoder?

For generating labels you used: df.assign(is_anomaly=0, time_idx=lambda x: ((x.hour - x.hour.min()).dt.seconds / 3600).astype(int)),, I could create a dataset from true (but noisy labels) instead.

Why do you chooose: max_encoder_length=0, # do not encode to set it to 0?

Thanks for pointing this out:

One potential downside you might want to be aware of is that metrik_0, metrik_1 and metrik_2 are z-score normalized accross > all values. You can normalize by each timeseries by using a GroupNormalizer but, currently, there is no way to normalize each subsequence based on on their own. However, of course, > you could do that in a PyTorch Module yourself.

could you link a line where I would need to start digging from / changing the code / integrating a custom module?

jdb78 commented 4 years ago

I have not leveraged PyTorch Forecasting for autoencoders, but you should definitely be able to do so.

Resouces to look at are current model implementations and the BaseModel

I think if you want to use PyTorch Forecasting all the way, I believe taking those steps should do it:

Set max_encoder_length to the window length of what you feed to your autoencoder.
Create a metric that measures the reconstruction error and use it as loss
Create a new model that inherits from BaseModelWithCovariates
- Set the transform_output attribute to None in the init method
- Override the step() method of the BaseModel and pass as y x["encoder_cont"]
- Use the y out of the dataloader only for validating that you can detect anomalies but not for training.

geoHeil commented 3 years ago

Many thanks. This is really interesting. However, why is manual feature engineering required, i.e. why do I manually need to create the sliding windows? I know that in other disciplines, such as NLP, whole documents can be fed in and the network will automatically derive distance-based features using attention (transformer, bert). Are you aware of something similar for time-series?

jdb78 commented 3 years ago

When training, you need to work with sliding windows (if only for memory reasons). Of course, Your test set can work with an infinite encoder length or prediction length so that the whole time series is parsed in one go.

However, I can think of reasons why this might be problematic in time series forecasting vs NLP:

Time series are not necessarily stationary - the longer the window, the more of an issue this might be.
Recurrent networks can suffer from exploding and diminishing output even after training in their hidden state which can lead to decreased performance for long time series. You can remedy the output by layer normalisation but not the hidden state. Not using a recurrent unit is another way out.
Performance: Inferring on a long time series will increase the computational requirement but not give you much more accuracy.

sktime / pytorch-forecasting

question: how to load data from timeseries of multiple devices #149