jdb78 / pytorch-forecasting

Time series forecasting with PyTorch
https://pytorch-forecasting.readthedocs.io/
MIT License
3.77k stars 600 forks source link

prediction result using dataloader and dataframe can not match #1163

Open Ying-Kang opened 1 year ago

Ying-Kang commented 1 year ago

Expected behavior

i finished trainning and get a model but when i predicted one time series with 2 different types using the same data source the prediction results are different with each other while they should be the same

Actual behavior

# using dataloader
actuals: tensor([[345084.,  50992.,      0.,      0.,      0.,      0.,      0.,      0.,
              0.]])
predictions: tensor([[431652.2812,  51685.0312,      0.0000,      0.0000,      0.0000,
              0.0000,      0.0000,      0.0000,      0.0000]])
# using dataframe
actuals: [345084.0, 50992.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
predictions: [431909.71875, 52412.390625, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]

Code to reproduce the problem

data = pd.read_csv(args.data_path)
valid_dset = TimeSeriesDataSet.from_dataset(train_dset, data, 
                                                # data_fore=raw_data_fore, 
                                                predict=True, stop_randomization=True)

val_dataloader = valid_dset.to_dataloader(train=False, batch_size=args.batch_size, num_workers=2)

best_tft = TemporalFusionTransformer.load_from_checkpoint(best_model_path)

predictions = best_tft.predict(val_dataloader)
predictions = best_tft.predict(data).tolist()[0]
Ying-Kang commented 1 year ago

dataloader inherited from TimeSeriesDataset, trainning procedure will dump the dataset settings into checkpoint

I found that when input is dataframe, .from_parameter() method shall be used .from_parameter() will use the paras in the trainning settings except the predict and stop_randoimization

As mentioned above, checkpoint was saved with these settings so, why are they different? is there any issue i should pay attention to?

Ying-Kang commented 1 year ago

I export the two candidate data in timeseries.py before turnning into tensor it's obviously that differences only occur at time_idx col, image I figure out that I initial time_idx using:

data["time_idx"] = list(range(len(data)))

so it is increasing when training and time_idx can reach to 10000+ while raw dataframe as input, time_idx start from 0 So, any suggestion about the initial value for time_idx? thx anyway