openclimatefix / Open-Source-Quartz-Solar-Forecast

Open Source Solar Site Level Forecast
MIT License
43 stars 40 forks source link

Evaluate live models #132

Open peterdudfield opened 3 weeks ago

peterdudfield commented 3 weeks ago

Detailed Description

It would be great to evaluate the live models

Context

Possible Implementation

Jacqueline-J commented 3 weeks ago

I'd be glad to work on this. Just for clarification. I am to follow the method here but with the updates 5min parquet?

and by 'run forecast over 2024-05 and see which ones does the best' do you mean run the forecast for each model (gradient boosting ICON gradients boosting GRs, XGboost) and compare the performance?

peterdudfield commented 3 weeks ago

hi @Jacqueline-J

Thanks for reaching out on this issue. In general follow that methodology but, Im not sure ICON will be availabel for those dates. We had a break in the ICON data. There's a chance we can get the forecasted NWPs from Open meteo, and that might be easier.

Yea, try to run this for each of the models and compare the results

peterdudfield commented 3 weeks ago

You might need to use https://open-meteo.com/en/docs/previous-runs-api in order to get NWP forecast that were made in the past

Jacqueline-J commented 2 weeks ago

Hi Peter,

Sorry for the delay. I've been working on this using a combination of my code and the repository's code, but I've encountered some issues. I'll explain my process:

• I created a test file with the first 10 PV IDs from the 5-minute parquet and then generated random timestamps for each of those systems. • I used that DataFrame to call the NWP data. However, I ran into an issue here: they only provided the cloud coverage variable, not the cloud coverage for low, medium, and high altitudes. • I then used the generated NWP data to produce predictions using:

predictions_df = run_forecast(site, model="xgb", ts=timestamp, nwp_source=all_nwp_data) 

• I merged the predictions with the 5-minute data to get the actual power generated and ensured that the columns were correct for passing to the metrics function. I have a question here: is the horizon hour supposed to be just the hour? So, for the timestamp 2024-05-03 10:00:00, is the horizon hour 10? • The MAE seems to be quite high, which leads me to my third question: are the API and 5-minute data on the same scale? I think they are both in Wh, but API responses tend to be on a much smaller scale (from 0.000000 to 2.264649), whereas the 5-minute data tends to be much larger (from 0.000000 to 298.419000). • I also modified the function to calculate the MAE for each PV system, but the results seem to have high error rates.

Here are the metrics: • Metrics for PV system 2607: MAE 51.7943 kW, normalized 1761.71% • Metrics for PV system 2626: MAE 75.6673 kW, normalized 2101.87% • Metrics for PV system 2638: MAE 137.1433 kW, normalized 3463.21% • Metrics for PV system 2657: MAE 117.8041 kW, normalized 2945.1% • Metrics for PV system 2660: MAE 28.4913 kW, normalized 712.28% • Metrics for PV system 2754: MAE 54.8527 kW, normalized 1371.32% • Metrics for PV system 2760: MAE nan kW, normalized nan% • Metrics for PV system 2766: MAE 32.9762 kW, normalized 1465.61% • Metrics for PV system 2770: MAE 81.2336 kW, normalized 2030.84% • Metrics for PV system 2775: MAE 86.2678 kW, normalized 2679.12%

However, the cloud coverage data was not accurate, which could be contributing to these high error rates.

Could you provide some guidance on these issues?

peterdudfield commented 2 weeks ago

Thanks @Jacqueline-J for this work.

What do you mean the API here?

The forecast horizon is the hours or minutes after the forceast is made. So horizon 10 (I think) is 10 minutes after the foreast is made. So if the forecast is made at 2024-06-10 09:00:00 then the hoirzon 10 value would be for 2024-06-10 09:10:00. Does this make sense?

Im sure the cloud coverage does effect it, but there might be ways to get round that? Which function do you use the get the NWP data?

It might be worth plotting a few forecasts and the truths, to understand a bit more whats going on

Jacqueline-J commented 1 week ago

sorry, I made an error in my last message. Not API, but the function to generate predictions. I plotted pv2607 for reference pv2607.

I think what could explain the differences is the units of measurements. Both are labeled as 'wh' but maybe the forecasted power is in Kwh? So if I convert predicted value from Kwh to wh the results begin to look better.

and If I resample the 5 minute data into hour intervals, then the results would look like this (assuming that the 5min data is the average power generated over a 5minute interval).

pv2607 resampled which is better, but I'm not sure this is the method I'm meant to take.

Jacqueline-J commented 1 week ago

Thank you for clarifying the horizon hour, I've amended my code.

In terms of the NWP data, I was using the python script generated from here https://open-meteo.com/en/docs/previous-runs-api and just entering mock variables for cloud coverage.

Jacqueline-J commented 1 week ago

This is how I'm generating the NWP data


import openmeteo_requests
import requests_cache
import pandas as pd
from retry_requests import retry

# pv_data from the test df
pv_df = test_df
pv_df['timestamp'] = pd.to_datetime(pv_df['timestamp']).dt.tz_localize(None)

# Setup the Open-Meteo API client with cache and retry on error
cache_session = requests_cache.CachedSession('.cache', expire_after=3600)
retry_session = retry(cache_session, retries=5, backoff_factor=0.2)
openmeteo = openmeteo_requests.Client(session=retry_session)

# Prepare an empty DataFrame to collect all results
all_nwp_data = pd.DataFrame()

# Loop through each row in the pv_df
for idx, pv_row in pv_df.iterrows():
    url = "https://previous-runs-api.open-meteo.com/v1/forecast"
    params = {
        "latitude": pv_row['latitude'],
        "longitude": pv_row['longitude'],
        "hourly": ["temperature_2m",
                   "precipitation",
                   "cloud_cover",
                   "shortwave_radiation"],
        "past_days": 0
    }

    responses = openmeteo.weather_api(url, params=params)

    # Process response
    response = responses[0]

    hourly = response.Hourly()
    hourly_temperature_2m = hourly.Variables(0).ValuesAsNumpy()
    hourly_precipitation = hourly.Variables(1).ValuesAsNumpy()
    hourly_cloud_cover = hourly.Variables(2).ValuesAsNumpy()
    hourly_shortwave_radiation = hourly.Variables(3).ValuesAsNumpy()

    # Define start and end time for the range
    start_time = pv_row['timestamp']
    end_time = pv_row['timestamp']

    # Create the date range for hourly data
    hourly_data = {"timestamp": pd.date_range(
        start=start_time,
        end=end_time,
        freq=pd.Timedelta(hours=1),
        inclusive="left"
    )}

    # Ensure the lengths of the data arrays match the length of the timestamp range
    required_length = len(hourly_data["timestamp"])

    hourly_data["temperature_2m"] = hourly_temperature_2m[:required_length]
    hourly_data["precipitation"] = hourly_precipitation[:required_length]
    hourly_data["cloud_cover"] = hourly_cloud_cover[:required_length]
    hourly_data["shortwave_radiation"] = hourly_shortwave_radiation[:required_length]

    # Calculate additional cloud cover variables
    hourly_data['cloudcover_low'] = hourly_data['cloud_cover'] / 5
    hourly_data['cloudcover_mid'] = hourly_data['cloud_cover'] / 2
    hourly_data['cloudcover_high'] = hourly_data['cloud_cover'] / 1

    # Convert to DataFrame
    nwp_df = pd.DataFrame(data=hourly_data)
    nwp_df['pv_id'] = pv_row['pv_id']
    nwp_df['latitude'] = pv_row['latitude']
    nwp_df['longitude'] = pv_row['longitude']
    nwp_df = nwp_df.drop('cloud_cover', axis=1)

    # Append to the all_nwp_data DataFrame
    all_nwp_data = pd.concat([all_nwp_data, nwp_df], ignore_index=True)

# Display the combined DataFrame
all_nwp_data.head(2)
peterdudfield commented 1 week ago

sorry, I made an error in my last message. Not API, but the function to generate predictions. I plotted pv2607 for reference pv2607.

I think what could explain the differences is the units of measurements. Both are labeled as 'wh' but maybe the forecasted power is in Kwh? So if I convert predicted value from Kwh to wh the results begin to look better.

and If I resample the 5 minute data into hour intervals, then the results would look like this (assuming that the 5min data is the average power generated over a 5minute interval).

pv2607 resampled which is better, but I'm not sure this is the method I'm meant to take.

greta stuff, impressive.

Yea the units from kw to kwh is always a bit confusing. Perhaps its worth being really clear in the variable names what they are, otherwise we will get confused again.

peterdudfield commented 1 week ago

This is how I'm generating the NWP data

import openmeteo_requests
import requests_cache
import pandas as pd
from retry_requests import retry

# pv_data from the test df
pv_df = test_df
pv_df['timestamp'] = pd.to_datetime(pv_df['timestamp']).dt.tz_localize(None)

# Setup the Open-Meteo API client with cache and retry on error
cache_session = requests_cache.CachedSession('.cache', expire_after=3600)
retry_session = retry(cache_session, retries=5, backoff_factor=0.2)
openmeteo = openmeteo_requests.Client(session=retry_session)

# Prepare an empty DataFrame to collect all results
all_nwp_data = pd.DataFrame()

# Loop through each row in the pv_df
for idx, pv_row in pv_df.iterrows():
    url = "https://previous-runs-api.open-meteo.com/v1/forecast"
    params = {
        "latitude": pv_row['latitude'],
        "longitude": pv_row['longitude'],
        "hourly": ["temperature_2m",
                   "precipitation",
                   "cloud_cover",
                   "shortwave_radiation"],
        "past_days": 0
    }

    responses = openmeteo.weather_api(url, params=params)

    # Process response
    response = responses[0]

    hourly = response.Hourly()
    hourly_temperature_2m = hourly.Variables(0).ValuesAsNumpy()
    hourly_precipitation = hourly.Variables(1).ValuesAsNumpy()
    hourly_cloud_cover = hourly.Variables(2).ValuesAsNumpy()
    hourly_shortwave_radiation = hourly.Variables(3).ValuesAsNumpy()

    # Define start and end time for the range
    start_time = pv_row['timestamp']
    end_time = pv_row['timestamp']

    # Create the date range for hourly data
    hourly_data = {"timestamp": pd.date_range(
        start=start_time,
        end=end_time,
        freq=pd.Timedelta(hours=1),
        inclusive="left"
    )}

    # Ensure the lengths of the data arrays match the length of the timestamp range
    required_length = len(hourly_data["timestamp"])

    hourly_data["temperature_2m"] = hourly_temperature_2m[:required_length]
    hourly_data["precipitation"] = hourly_precipitation[:required_length]
    hourly_data["cloud_cover"] = hourly_cloud_cover[:required_length]
    hourly_data["shortwave_radiation"] = hourly_shortwave_radiation[:required_length]

    # Calculate additional cloud cover variables
    hourly_data['cloudcover_low'] = hourly_data['cloud_cover'] / 5
    hourly_data['cloudcover_mid'] = hourly_data['cloud_cover'] / 2
    hourly_data['cloudcover_high'] = hourly_data['cloud_cover'] / 1

    # Convert to DataFrame
    nwp_df = pd.DataFrame(data=hourly_data)
    nwp_df['pv_id'] = pv_row['pv_id']
    nwp_df['latitude'] = pv_row['latitude']
    nwp_df['longitude'] = pv_row['longitude']
    nwp_df = nwp_df.drop('cloud_cover', axis=1)

    # Append to the all_nwp_data DataFrame
    all_nwp_data = pd.concat([all_nwp_data, nwp_df], ignore_index=True)

# Display the combined DataFrame
all_nwp_data.head(2)

And its probably worth noting, this is getting the final NWP for that timestamp? Or is it getting the NWP at the time of that timestamp? These NWP are forecast which make them slightly confusing