vividfog / nordpool-predict-fi

A Python app and a Random Forest ML model that predicts spot prices for the Nordpool FI market.
MIT License
60 stars 8 forks source link

Model accuracy issues #11

Closed Maaxion closed 4 months ago

Maaxion commented 4 months ago

There are a few major issues with the model, the largest being the way the train/test data is split. Since it is based on temporal / time series data each datapoint that is next to each other is highly correlated. If you do a random train/test split you will leak training data into the test set and the as such the model is likely to show artificially high performance on the test set. This phenomenon is known as data leakage, where information from outside the training dataset is used to create the model. In the context of time series data, where measurements are dependent on time, the sequences of data points are often highly correlated with their immediate predecessors and successors. If a random train/test split is employed without considering the temporal nature of the data, some of the information from the training set can inadvertently be included in the test set.

To address this issue and avoid data leakage, you should use a time-based split for the train/test division. This approach ensures that the model is trained on data from a certain period and tested on data from a subsequent period, mirroring how the model would be used in real-world forecasting or time series analysis scenarios. By doing so, the test set acts as a more accurate representation of future, unseen data, and the model's performance metrics are more reliable.

Further, running the trained model on past data that is included in the training set will likely lead to overly optimistic performance metrics. This is because the model has already "seen" this data during training, and as such, it can easily predict these outcomes, which doesn't accurately reflect its ability to predict new, unseen data.

To calculate historic performance you can employ a technique known as walk-forward validation or rolling forecast origin. This method involves incrementally moving the cut-off point between the training and test sets forward in time, training the model on a fixed or expanding window of past data and testing it on the following period. Each step forward allows the model to be tested in a manner that simulates real-world forecasting situations, where only past data is available to predict future outcomes.

By continually retraining the model on the most recent data and forecasting the next period, you can assess how well the model adapts to changes in the data over time. This method provides a more robust and realistic evaluation of the model's predictive performance and its potential effectiveness in practical applications.

This technique can help identify model overfitting early on. Overfitting occurs when a model learns the noise in the training data to the extent that it negatively impacts the performance on new data. Because walk-forward validation tests the model on multiple, consecutive future points, it offers a clearer picture of how the model generalizes to new data. If the model performs well on the training set but poorly on the test sets during the walk-forward validation, it's a strong indication that the model may be overfitting.

vividfog commented 4 months ago

That sounded a little like a GPT generation 😀 ... it was well written!

Anyways, I did mention two key points in the README about this.

First, this is not a time series forecast. We don't record the dates. Only month, weekday and hour. It doesn't know which data follows what in sequential back to back detail. More about the hypothesis in the README.

Second, the model might be overfit to historical data. Yet its primary function is to predict short-term outcomes, where it has been useful in practice, determining the near-term direction and range quite accurately. Quote from README.

I agree with what you wrote, for almost all other cases, and it is a happy accident if this model works. It's built for predicting if it's a good day to charge a car, or should I wait until Wednesday. As long as it does that human goal outcome, the R2 doesn't even matter. They are calculated for curiosity and to measure models to one another. Not as a benchmark in a general sense of any kind.

I would not use this hypothesis for any other problem I know! But for this one it Works for Me ™ and somehow doing things "wrong" appears to result in a useful model. As far as a hobby project goes.

I haven't come up with a way to do things "right" that results in a more useful model. Many variations were tested before settling with this. More in the README.

Thanks for the heads up. I do appreciate it.

Maaxion commented 4 months ago

That sounded a little like a GPT generation 😀 ... it was well written!

I will take that as a compliment, I guesss.

I think you must have misunderstood my comment. Your model is most likely exceedingly overfit. The qualitative metrics you've posted do not show the models performance as they are calculated on the data that was used to train the model.

Anyways, I did mention two key points in the README about this.

First, this is not a time series forecast. We don't record the dates. Only month, weekday and hour. It doesn't know which data follows what in sequential back to back detail. More about the hypothesis in the README.

The model does know which data follows what, as you are including hour, day of week, and month features in the model.

You can read more from the book FPP3, more-or-less the gold standard for time series forecasting: https://otexts.com/fpp3/accuracy.html

Since you're using SKlearn, you could use e.g. TimeSeriesSplit to properly split your data.

Here's also a decent blogpost on why RF is not that great for time series forecasting (As you note, it is not presented as a method in FPP3)

vividfog commented 4 months ago

Thank you for the clarification and the references. I hear where you're coming from.

I did try various cross-validation techniques while working on this, and I just tried TimeSeriesSplit too, hoping it could indeed capture temporal relations, if there are temporal progressions to be found. Model performance was very bad, R2 around 0.11. Perhaps that could be tweaked higher via different splits, more data, or different hyperparameters, but I doubt it would result in a strong model. And I don't want to use data before 2023; 2022 was a very unusual year.

Reverting back, I also tried removing the month element from the training data, and the model performance didn't change much. Now there's only "weekday" and "hour" left of a time nature in the training data set. No month, no date, no year → no change (almost). If this is a temporal prediction, that should be odd? Perhaps suggesting that the core drivers of price fluctuations are captured by other variables instead?

# With month element, baseline, 80/20 split
# X_filtered = df_filtered[['day_of_week', 'hour', 'month', 'NuclearPowerMW'] + fmisid_ws + fmisid_t]

MAE (vs test set): 1.805931342463788
MSE (vs test set): 8.734900902480973
R² (vs test set): 0.7906288561283733

# Month element removed, only minor degradation, same split
# 
X_filtered = df_filtered[['day_of_week', 'hour', 'NuclearPowerMW'] + fmisid_ws + fmisid_t]

MAE (vs test set): 1.9163698394683955
MSE (vs test set): 9.552172445871683
R² (vs test set): 0.7710392717926379

In my own experience, watching these charts for many months with sweaty palms (they have real world € consequences), and reading the human predictions out there, I've come to believe that energy prices are subject to abrupt changes due to a myriad of factors beyond simple temporal progressions.

January's 200+ c/kWh spike is an extreme example of that. Yesterday, we went from zero to a few cents in one discrete jump and can consider it normal. The training data has some apparent continuous temporal trends, but those kinds almost are the ones to ignore, not to learn from.

My hypothesis posits that price dynamics are influenced by specific hours (indicating activity levels), weekdays (reflecting business operations), and related weather conditions (such as heating demand and wind energy production)... And make an individual prediction based on those. For individual hours. Because it doesn't matter what the price is now. It could be 10x or 100x the next hour if the wind stops and it's still cold.

Should temporal patterns become apparent—when visualized as sequential predictions—this, from my perspective, is more a consequence than a causative factor in the model's framework. That's why I treat this as a niche problem with a niche solution and don't use conventional wisdom for time series forecasting. It's not entirely time-free either. Somewhere in between.

Your expertise in ML is evident from how you write, so I take seriously what you may suggest next. My past trials, and this new trial, to me adds to the evidence that the problem shouldn't be treated as a time series to forecast. We do have "weekday" and "hour #" in the training data, but the training should (and I believe does) treat them more like labels associated with weather events and nuclear power MW values, rather than "time" as such.

I recognize the unconventional nature of basing accuracy evaluations on past training data, yet this approach is intentional, tailored to the unique demands of what I want to predict. If for the learned conditions, say for a Wednesday at 9 PM with high winds and very low temps and low nuclear power, the price is balanced to somewhere in the 10 cent range: Let's say this happens again. I do want the prediction to be 10 cents again. And if it's not that cold but still cold, then perhaps it should say 8 cents. Those are the kind of "patterns" I want the model to learn. Not what kind of price follows what time. It doesn't matter if those hours are 7 months apart from each other.

This is to say: I don't think of this as seasonality. I think it as "recurrence". When the same events happen again, the same price (hopefully) happens again. And if somewhat similar events happen, somewhat similar prices emerge. No matter what the price was the previous hour, and the next hour. It may have been zero, or 100x compared to "now".

Perhaps for clarity though, I should not "leak" this philosophy into how the evals are displayed. I realize it looks odd for those who've done ML for a while. Then again ... it's by design, to cater for this very particular puzzle this repo adds a few cents to decipher.

Side note: I'm not entirely convinced the headline of this issue is accurate. For the few days the model has been running, it has drawn a line very close to the real-world prices and the predictions have been very helpful to me, also in line with the human predictors in some of the discussion group. I'd like to read it as "eval accuracy issues" rather than "model accuracy issues". Because I don't think the model has been proven to be accurate OR inaccurate quite yet. It's premature to say either way. Next week's results might shed light to that.

This was a long reply. Hope it was helpful. Thanks for the critique and insights you may have. I encourage you to view this through a lens that perhaps diverges from traditional time-series analysis and look forward to any alternative suggestions you might offer.

Also remember this is a weekend hobby (currently a vacation hobby) and while I like the model to be "good" for what it's made for, I don't mind if it's not "great" as a flawless genie. It's open source, so someone with more time in their hands can always re-write the key bits, or re-think the whole thing. Perhaps this serves at least as a ladder step or a stop-gap until then.

What I intended to work on tonight was better visualization of the past predictions. Because that indeed would make any trends apparent in how accurate or inaccurate the model is in future conditions that pass, and people have been asking for a chart for that. Perhaps I'll get to that tomorrow. There's an issue about that next to this one. Feel free to add ideas there too.

vividfog commented 4 months ago

Because why not, I also tried this:

#    X_filtered = df_filtered[['day_of_week', 'NuclearPowerMW'] + fmisid_ws + fmisid_t]
#    y_filtered = df_filtered['Price_cpkWh']

  MAE (vs test set): 2.158027543175604
  MSE (vs test set): 11.411013323353718
  R² (vs test set): 0.7264837988526746

That model would still seem somewhat useful for the intended purpose, albeit predictably lower performing, and now the training data is devoid of anything that resembles a sequence, temporal or otherwise. Weekday is expressed as an integer, yet another value, just like wind speed. Given that it's only 7 values, it's almost like a label, classification.

This experiment underscores my belief that we can find meaningful patterns and predictions in the data by focusing on the specific conditions of each moment, rather than by trying to apply time-series forecasting practices to it.

Maaxion commented 4 months ago

I did try various cross-validation techniques while working on this, and I just tried TimeSeriesSplit too, hoping it could indeed capture temporal relations, if there are temporal progressions to be found. Model performance was very bad, R2 around 0.11.

This happens because with a proper train/test split that isn't leaking data, the performance metrics are more accurate and reflect the actual performance of the model. When you have data leakage between your train and test data set, metrics like R2 shoot up high.

Reverting back, I also tried removing the month element from the training data, and the model performance didn't change much. Now there's only "weekday" and "hour" left of a time nature in the training data set. No month, no date, no year → no change (almost). If this is a temporal prediction, that should be odd?

It is completely expected, and can be explained for many reasons. As you only have one year of data, your month feature essentially only has one data point (one for each month). In order to effectively add in month, quarter, or year features you would need multiple years worth of data.

You mention that you don't want to include 2022 in your dataset, but I do think you should. The seasonality effect will still be there and will be correct. This perhaps the very largest effect on the power price, as the power consumption is around 2x at peak of winter compared to the low of summer. When you are missing this data from your dataset, as we progress through summer the model will not be able to accurately incorporate this seasonality effect.

For individual hours. Because it doesn't matter what the price is now. It could be 10x or 100x the next hour if the wind stops and it's still cold.

Electricity price forecasting is something that is pretty well researched, if you jump down the rabbit hole you'll find a lot of studies on the matter. The power price is not random, or random-like, but has strong correlations. Many of which are linear (Note: I am definitely not an expert on power price forecasting). But as the market reaches "unstable" points, e.g. what happened on the 5.1., the normal factors that influence price become affect it less. On the 5.1. purchasers essentially bought more power than there were available, the market entered a new state where the normal driving factors no longer were the primary ones driving the market price. Because almost all supply was bought up, the main factor affecting the market price for that auction was essentially just the cost of the last few 100 MW of production. When the market is in a normal state, I.e. when there is excess production available, and consumption is within normal bounds, models will tend to work well. But at the extremes, most models will fail, mainly because there is very little data of these types of scenarios so you can't really model them.

The things that add complexity to modeling the power price in e.g. Finland is Nordpool and the Euphoria algorithm. Since Finland is not in a vaccuum, but connected to the nordic and european power markets via many interconnects, you have many more factors that affect the power price. IMO a model that would get .9 R2 would more-or-less have to model each nordpool electricity region separately, as well as somehow include the rest of europe as well. Even if aiming for lower accuracy, for Finland alone you'd want multiple points of weather data as it is a very different scenario if it is -30c in Lapland, or in Helsinki. You'd also want some kind of regressor on the weather data as it is a pretty known effect that the more days in a row of cold, the higher the power consumption goes.

You can use some of the decomposition methods mentioned in FPP3 to help analyze which feature of your data affects the price the most.

We do have "weekday" and "hour #" in the training data, but the training should (and I believe does) treat them more like labels associated with weather events and nuclear power MW values, rather than "time" as such.

Time in-of-itself is just a proxy for the correlations between the data. Even if you remove the time features from the dataset, the underlying correlation still exists in the data, albeit with a weaker signal.

I recognize the unconventional nature of basing accuracy evaluations on past training data, yet this approach is intentional, tailored to the unique demands of what I want to predict.

I'm afraid it isn't just unconventional, it is is simply incorrect.

Maaxion commented 4 months ago

January's 200+ c/kWh spike is an extreme example of that. Yesterday, we went from zero to a few cents in one discrete jump and can consider it normal. The training data has some apparent continuous temporal trends, but those kinds almost are the ones to ignore, not to learn from.

Domain knowledge is, perhaps, the second most important factor involved in making an accurate model.

I would read up on how nordpool works, and how the Euphoria algorithm works. It is entirely normal for the power price to jump from zero to a few cents, as when wind is high and consumption low there'll be power suppliers who do not want to shut down, and would rather sell power at a loss. This is for many reasons, the power price may be estimated to be low for a shorter duration than they can ramp down and up, or because they are contractually obligated to be running, or because they still incur so many fixed costs even when off that it doesn't matter, or simply because they are crap at selling power.

If you look at the aggregate order curves published by nordpool you'll see that the offers have multiple "steps" in them, this is one reason why the power price can react nonlinearly to continuous predictors. Once the consumption hits the level of one of those steps, the power price rises with a step.

To more accurately account for that, it would be good to include in the model power production at various cost levels (e.g. coal at like 200e / MWh, etc.).

Any power price prediction model for Finland is esentially trying to model the Euphoria algorithm, as that is what fundamentally creates the price of electricity.

vividfog commented 4 months ago

Note that I don't intend to be able to accurately predict the next 300 cent black swan or outdo the state of the art, that's out of scope for this hobby and a high bar for any project. I do hope to know what's going to be the cheapest day this week to charge my car on "regular" market conditions. If there is state of the art, it's not publicly available. I haven't read up on Euphoria. Thanks for the pointer. I have however been studying Nordpool as a consumer/observer for quite a while.

You imply that there may still be hidden autocorrelation inside the data. I went on and tried to surface that, because that wouldn't be good for the RF, as you pointed out.

Hypothesis: If there are strong or dominating temporal patterns (even if not derived from a time stamp, but from a sequence, autocorrelation, hidden patterns) in this training data, and the model has learned to depend on them, then wouldn't it be true to assume that:

  1. Autocorrelations would show up in autocorrelation tests (and if not, why)? But autocorrelation is negligible with the tests I'm aware of:
→ Durbin-Watson autocorrelation test: 2.01
→ ACF values for the first 5 lags:
  Lag 1: 1.0000
  Lag 2: -0.0083
  Lag 3: -0.0066
  Lag 4: -0.0079
  Lag 5: 0.0133
  Lag 6: -0.0285
  1. Time would show up as a major factor in feature importances, like in the example in the article you shared. But in this data set it's the weather that does most of the explaining:
→ Feature Importance:
       Feature  Importance
      t_101339    0.165141
     ws_101256    0.125596
      t_101786    0.112542
      t_100968    0.098442
      t_101118    0.092589
NuclearPowerMW    0.087611
     ws_101673    0.077983
     ws_101846    0.077760
          hour    0.049461
     ws_101267    0.044495
   day_of_week    0.042169
         month    0.026208
  1. Model performance would degrade or crash if devoid of direct time-bound variables? But removing all but weekday (I bet I could remove that too) still result in a model with R2 in the same ballpark useful for the designed purpose of the model. I shared this in my previous comment.

  2. Model performance would crash if the order of data rows (already lacking time stamps) are shuffled before training? I'm not so sure about this assumption, but nevertheless, the scores don't crash. R2 for sorted vs shuffled is 0.79 vs 0.74.

  3. Permutation test would indicate that a random chance plays a big role? But the opposite seems to be true:

→ Permutation Test Results (will take LONG while):
  Permutations Baseline MSE: 17.3964
  Permutation Scores Mean MSE: 90.6635
  p-value: 0.0099

This code is now part of the latest commits, in util/train.py.

After there's a more robust set of evals, I'll replace the README numbers with those, or remove them. I've grown to agree that reporting "unconventional" numbers (even if for my personal purposes they were good for tracking deltas) might not be optimal for inviting external feedback.

Maaxion commented 4 months ago

I think you need to take a step back and first get the basics right before trying to do too many complicated things.

Measuring R2 only on your training data just shows that your model is able to explain the training data set, but it says nothing about how the model would perform on unseen data.

You're currently seeing these same good metrics on your train/test split because it is leaking data from the train set into the test set because of the random split.

When you corrected the train/test split to not leak data into your test set you saw the performance metrics plummit. That wasn't your models performance changing, the metrics then reflected the models performance on new, unseen data, rather than the known test data.

That crash in metrics is a very tell-tale sign that your model is overfit (so is getting 0.9 R2 with the features you have, as the features in the model now definitely does not account for 90% of the variance of the power price, and the fact that your model was 100% off in accuracy for Sundays power price shows this).

vividfog commented 4 months ago

The model can be a 100% off for individual hours and that's OK. We're not pricing NASDAQ options here, we're trying to estimate if today is a good day to charge our cars or heat our boilers, because tomorrow and the few days that follow will be more expensive, or less expensive. That's the bar! Yes, it's that humble...

It doesn't matter if it's 2 vs 4 cents, or 15 cents versus 20 cents, since the range variation overall is 2x, 5x, 10x on a regular basis. If I can get the number of digits right, that's a big win compared to what I had 2 weeks ago, when I had nothing! :)

→ If you have a higher standard, that's OK. I would too, if this was a job. But it's a February hobby. Maybe March too, we'll see.

Sunday wasn't that bad I think. Here is a plot of pre-training and post-training actuals and predictions. For coming from such a wrong place, it has been more than enough for charging a car today and not tomorrow. Also, this is the Olkiluoto 3 shutdown weekend, so it's not the easiest time period to crack at the first try. At least the means are staying between min and max lines here.

plot-2024-03-03

The day when the model was last trained in bold (2024-02-29), and the column I'm concerned about is Error in cents:

timestamp Predict_mean Actual_mean Error_c Predict_min Actual_min Predict_max Actual_max
2024-02-22 00:00:00+00:00 3.3 2.9 0.4 0.9 0.2 4.8 4.3
2024-02-23 00:00:00+00:00 0.9 0.4 0.5 0.1 -0.1 3.1 1.8
2024-02-24 00:00:00+00:00 3.3 3.2 0 0.3 0 5.1 5.3
2024-02-25 00:00:00+00:00 3.7 3.9 -0.2 2.9 3.3 4.1 4.4
2024-02-26 00:00:00+00:00 5.9 6.1 -0.2 3.4 3.3 7.6 8.1
2024-02-27 00:00:00+00:00 6.8 7.5 -0.7 3.1 2.1 9.5 13.8
2024-02-28 00:00:00+00:00 3 2.6 0.3 2.5 2.1 3.6 3.4
2024-02-29 00:00:00+00:00 3 1.5 1.5 1.3 -0 4.8 2.6
2024-03-01 00:00:00+00:00 2.8 2.9 -0.1 1.2 0 3.7 3.9
2024-03-02 00:00:00+00:00 7.8 5.7 2.1 2.8 4.1 10.9 7.9
2024-03-03 00:00:00+00:00 10.2 8.8 1.4 7.9 6.4 11.9 12.5
2024-03-04 00:00:00+00:00 14.2 11.6 2.7 11.3 8.2 17.9 15.2
2024-03-05 00:00:00+00:00 15.8 nan nan 11 nan 19.7 nan
2024-03-06 00:00:00+00:00 14.2 nan nan 10.6 nan 17.7 nan
2024-03-07 00:00:00+00:00 13.9 nan nan 9.8 nan 17.2 nan
2024-03-08 00:00:00+00:00 13.4 nan nan 10.1 nan 16 nan

If you're right and this will fail in a glorious way next week, I'm the first one to cheer for that. Because then I have the motivation to "step back" and re-think, have something fun to do next weekend. Until then, I'm going to enjoy the rest of the last day of the vacation and perhaps finish the past-prediction visuals, #10. The sketch above is in data/create/60_plot_results, a quick throwaway script until there's a GUI for that.

I'll do some cleanup for the eval statements made in the README, because I don't think they're too good at communicating the intent and motivation right now. The current copy will likely not satisfy anyone who's researched more than a few ML projects, even though I've written plenty of disclaimers to avoid making any inflated claims, especially on the live demo page.

Such as: "Tulevaisuudessa nämä samat olosuhteet saattavat johtaa erilaiseen hintaan, jolloin malli on ennusteineen väärässä." aka "In the future, these same conditions may lead to a different price, at which point the model with its predictions will be wrong." That's not fine print, it's paragraph font right next to the graph. Repeated elsewhere in many ways: "pörssisähkön ensi viikon hintaan ei tällä vatkaimella ole todellista näkyvyyttä, saati valtaa". Not sure how to translate that.

Based on what I've learned, weather plays a big role here, and weather forecasts themselves are not accurate. This means that even if the upper layers of the ensemble were perfect, this challenge isn't solved until weather forecast too is solved.

I think I'll close this issue soon and think of it as an umbrella enhancement for later, when there's more actual data of how the model works and how it fails. It will then stem potentially a number of more specific issues/tasks. There's work to do at the upper levels of the stack before that.