openclimatefix / ocf_datapipes

OCF's DataPipe based dataloader for training and inference
MIT License
13 stars 11 forks source link

Fix PV system IDs #220

Closed dfulu closed 10 months ago

dfulu commented 1 year ago

Describe the bug

When using the Passiv PV for training and loading it usingocf_datapipes.load.pv.pv.load_everything_into_ram(), the function takes the ss_id ID and renames it as pv_system_id in the xarray.Dataset returned. In the input metadata, there is also a column named system_id, which is not carried through to the xarray.Dataset. Both of the ss_id and system_id initially come from the sheffield solar API and they are different.

Currently, only the system_id ID is available in our database, although it is under the pv_system_id column there. Also, when we load from the database we encode this ID like n -> int(f"{n}1").

All this makes it quite difficult to match between the same systems in training and production. Being able to match will be important for adding Passiv inputs to PVNet as per openclimatefix/PVNet#68

Some potential fixes

  1. Stick to the ss_id
    • This would mean making ss_id available in the database
    • This could avoid us needing to encode the ID when loading from the database
    • I believe the ss_id is unique for each system across Enphase, Passiv, and PVOutput. This means that we would need to change the library if we wanted to use PV systems outside of these sources.
    • However, we already have the ss_id (sheffield solar ID?) already hard coded into load_everything_into_ram(), so we would likely need to change that anyway.
  2. Stick to system_id.
    • The database wouldn't need to be changed, but I think some work would still need to be done to get_pv_power_from_database() and maybe think about the encoding.
    • We would also need to pull system_id through load_everything_into_ram(), and that may involve some renaming of the IDs variables.
  3. Create our own ID strategy, perhaps like the encode_label() and apply this everywhere consistently.
  4. Commit to using multiple IDs.
    • Add system_id to the dataset created in load_everything_into_ram() and select systems based on this.
    • Don't change get_pv_power_from_database() but just select the IDs like system_id -> int(f"{system_id}1")
    • This could be the easiest option to implement, but feels like it adds technical debt
dfulu commented 1 year ago

@jacobbieker and @peterdudfield, thoughts?

jacobbieker commented 1 year ago

I would be inclined to our own encode_label() option, as we use more PV systems, it could keep the code a bit easier than multiple ID systems. Keeping them consistent might be a bit of work, but probably good long term too

peterdudfield commented 1 year ago

Thanks for putting this all together, its really nice having it all laid out. Ive put some thought into here - https://github.com/openclimatefix/nowcasting_dataset/issues/691, but never finished it.

Few comments

  1. The problem with 1, is we will soon have a few systems that are not coming through Sheffield Solar, so that might muck this up.

  2. changing the database is annoying.

  3. is what I've linked too, and perhaps that is the good way to do it. Enocoding like this would work, i'd been tempted to do n -> int(f"{n}{xx}") where xx i the number of provider. We might quickly go over 10 so 2 digits gives us a bit more room

  4. if you have time not add tech debt, then we should

dfulu commented 1 year ago

I agree that load_everything_into_ram() should be modified.

pv_system_row_number is computed based on the input dataset and I don't know if it would be a reliable ID to use to identify the system between datasets. Like between training and production. When loading training data, that number comes from the order of the PV systems in the training dataset, like the PV/Passive/ocf_formatted/v0/passive.netcdf file on GCP rather than being from the metadata csv. The order of the the PV systems in the passive.netcdf (and therefore the pv_system_row_number) seems to be quite arbitrary.

Also, unless I'm mistaken, for loading the PV data from the database we set the pv_system_row_number to 1 for all systems.

Something like 3 might be the best bet

dfulu commented 10 months ago

I'm going to close this now. This was fixed by #225 and #226 and by adding ml_id to the training data in order to match the database data