Unsatisfactory Fine-tuning Results: input-independent and large deviation from real values

simona-0 commented 4 months ago

Hi, thank you for your contribution. I have been trying to fine-tune your model in a univariate time series forecasting task with C-MAPSS turbine datasets, the goal is to learn the trajectory pattern and thus predict the future trend. I started off with 140 trajectories as training sets (each corresponds to sensor_14 from C-MAPSS dataset, ranging from the beginning of the experiment till the failure in the end) and 60 trajectories as testing sets (also from sensor_14). Due to bad test results, I am now testing on the training datasets and trying to overfit the model on purpose, and see if the model would memorise any trajectory by heart. However, the model won't learn much from the training datasets, still have relatively high training loss (from ~8 to ~3).

Grid search on lr, batch_size and context_length was done, yet the best train loss the model could obtain (after some few thousands epochs) is still around 3, and if we look at the prediction as shown below, it is obvious that the model was a bit 'lazy' and only tried to find the average of the trajectories it saw during training to a point it's almost input-independent. The whole trajectory was visible to the model so it is not like the end is hidden from the model so it doesn't learn the deterioration pattern. Also the first prediction at time t+1 tends to have large deviation with true value.

Could this be related to the training process or the model itself? Would really appreciate your opinion on this.

simona-0 commented 4 months ago

don't know if this could be relevant to anyone, after changing nonnegative_pred_samples from True to False, manually standardising my training datasets and manually restoring the mean and std of test datasets , the results look much better. I also tried using scaling='std' or scaling='robust' and leave my datasets without standardisation, but same old problem occurs.

ashok-arjun commented 4 months ago

Thanks @simona-0 for the issue and following up with a fix.

I'm not sure why this is happening, but the scaling does have something to do with it.

By "manually standardising", do you mean globally computing a mean and standard deviation from the training set, and standardizing the training and test sets? And in this case, do you turn off standardization in the model?

Also, on a related note: Are you using FP32 format for the model/data, or other quantized versions (like FP16/BF16)?

simona-0 commented 4 months ago

Hi @ashok-arjun thank you for your speedy reply, by manually standardising I meant subtracting both the train and test datasets with the average of train dataset of all the trajectories across all time, and then dividing the difference by an average of the standard deviation calculated from each trajectories in the train dataset (keep test dataset untouched of course). When I did that, there is no standardisation happening within the model, cos I turned it off. The results are really satisfactory as shown in the last comment.

I also tried with simply passing in scaling='std' without manual standardisation, the outcomes look rather poorly. As for the datatype, all the trajectories are in float64.

BTW I am working on a thesis in fine-tuning and eventually extending lag-llama to multivariate TS forecasting, and I find this scaling issue really interesting. What I assume is that for lag-llama, the global scaling in data preprocessing tends to suit datasets with big magnitude and smaller standard deviation, compared to the window-scaling built-in in the model, would definitely appreciate your thought on this.

bannatyne84 commented 3 months ago

I see a similar thing. As mentioned here, I ran it with nonnegative_pred_samples both True and False to no observable effect. Here scaling is also set to 'std'.

I don't think I can analyse this in the same depth as you esteemed academics, but it does seem to be related to just the pure magnitude of the data; this graph seems closer if not perfect, since it is an entirely predictable repeating series of integers 1-100.

and this with scaling as 'robust'. I don't see a huge amount of difference, if any. All CPRS scores have been essentially .99999 between context lengths. I'm obviously missing something fundamental about the model.

simona-0 commented 3 months ago

@bannatyne84 Hi, I think we have the same issue here. A quick fix would be to first manually standardise your datasets before feeding them to the model, by doing so you reduce the magnitude and squeeze the deviation of the datasets acceptable for lag-llama, and of course you need to restore the model output using the mean and standard deviation from the initial standardisation step. The flag nonnegative_pred_samples doesn't matter in your case cos your datasets don't have any negative values. As for why such a fix works and why setting the flag scaling='std' or scaling='robust' doesn't work, it might have something to do with the model itself.

time-series-foundation-models / lag-llama

Unsatisfactory Fine-tuning Results: input-independent and large deviation from real values #85