sktime / pytorch-forecasting

Time series forecasting with PyTorch
https://pytorch-forecasting.readthedocs.io/
MIT License
3.88k stars 614 forks source link

Manage imbalancing in TFT #1040

Open LuigiDarkSimeone opened 2 years ago

LuigiDarkSimeone commented 2 years ago

I have a dataset of several shops. For each I have a time series of sales. Shops are spread unequally in the world (1000 in us, 100 in EU), I need to predict the sales based on the location and other variables. However, such data set is imbalanced. Is there a way to manage imbalance in TFT? (upsampling, downsampling, apply a weight-balance similar to sklearn, or force each batch to select equal number of example)

fnavruzov commented 2 years ago

Have you tried "weight" argument while creating datasets? You can create a column with weights to be used in training

ds = TimeSeriesDataSet(
    data=data[train_data_filter],
    time_idx=time_idx_col,
    target=...,
    weight='weight', # pass name of a weight column in your df, samples/sampler weight(s)
    group_ids=group_ids,
    ...
)
RonanFR commented 2 years ago

Hi @LuigiDarkSimeone,

1) As suggested by @fnavruzov, on way to "rebalance" the dataset could be to use the weight argument of TimeSeriesDataSet. This will generate a weight tensor in addition to the target tensor used while fitting the model. Note that in this case, the portion of the loss associated to each sample is weighted differently. This is similar to what is done in scikit-learn (sample_weight argument of method .fit(...))

2) You could also use the weights to alter the probability of a given sample to be part of a mini-batch (sampling scheme). As indicated in the documentation, you can call method to_dataloader with a custom sampler, for example an instance of torch WeightedRandomSampler. You can find a small example here.

3) You can aso combine both 1) and 2).

N.B: The DeepAR paper empirically shows the benefit of method 2) compared to not using any weights. To the best of my knowledge, they do not present any result based on method 1). That being said, in their setting, the issue is the size of the dataset and the main problem in this case is to be able to select the most relevant samples (since the total number of samples is huge, it may not be possible to go over all samples several times during the training procedure and they show that weighting the samples based on their "velocity" greatly improves the performances).

See also: Weighted loss functions vs weighted sampling?

LuigiDarkSimeone commented 2 years ago

First of all thanks to @RonanFR and @fnavruzov, for your replies. Lately it has been quite hard to get answers in here. I will have a look at your oprions and test them to get whether they are suitable for my case.

Due to the struggling I am having to get answer, and since you look expert, I would like you to kindly have a look at this question I posted quite a few days ago (which it will never get an answer I guess):

https://github.com/jdb78/pytorch-forecasting/issues/1032

I know it is not good practice to post another question in a different issue, so I really apologise in advance, but I cannot get over this problem, even after looking the source code. Hope to hear from you soon

many thanks Luigi

FrancescoFondaco commented 2 years ago

Thanks @RonanFR, @fnavruzov.

I am trying to implement what you've suggested using the "weight" argument in the TimeseriesDatasetClass in order to manage imbalances in my dataset.

training = TimeSeriesDataSet(
    myData,
    time_idx="Time_idx",
    target="TVPI",
    group_ids=["Fund"],
    min_encoder_length=8,  
    max_encoder_length=80,
    min_prediction_length=1,
    max_prediction_length=30,
    weight="Weight"
    static_categoricals=...

Where the Weight column contains the weight associated to each sample. image

Unfortunetly the described implementation raises the error below: image

Would you know how to solve it? Thanks, Francesco

RonanFR commented 2 years ago

Hi @FrancescoFondaco ,

Can you provide a detailed minimal reproducible example that raises this error ? (small toy dataset of only few lines)

QijiaShao commented 2 years ago

Thanks @RonanFR, @fnavruzov.

I am trying to implement what you've suggested using the "weight" argument in the TimeseriesDatasetClass in order to manage imbalances in my dataset.

training = TimeSeriesDataSet(
    myData,
    time_idx="Time_idx",
    target="TVPI",
    group_ids=["Fund"],
    min_encoder_length=8,  
    max_encoder_length=80,
    min_prediction_length=1,
    max_prediction_length=30,
    weight="Weight"
    static_categoricals=...

Where the Weight column contains the weight associated to each sample. image

Unfortunetly the described implementation raises the error below: image

Would you know how to solve it? Thanks, Francesco

Have you figured out this issue? I am having the same issue after adding the "weight" parameter. Thx!

terbed commented 1 year ago

Dear @FrancescoFondaco and @QijiaShao, I suspect the issue is related to the automatic fill-forward nan mechanism. If your time index is not continuous then the missing steps are filled but the weights are missing for those samples. So you should disable automatic filling in case you are using weights. This is just a guess.

Best wished, Daniel