openclimatefix / nowcasting_utils

Common functionality between SatFlow and predict_pv_yield
https://nowcasting-utils.readthedocs.io/en/stable/
MIT License
3 stars 0 forks source link

Tool to plot all data in an on-disk batch #46

Closed JackKelly closed 2 years ago

JackKelly commented 2 years ago

Detailed Description

As @peterdudfield shows in openclimatefix/nowcasting_dataset#178, sometimes the best way to identify bugs in data is to plot the data :)

It would be awesome to have a tool which "simply" plots / animates all the data in each example, so we can visually check that the data looks sensible (e.g. check that PV output decreases as clouds pass overhead, etc.)

Possible Implementation

Maybe this should live in nowcasting_utils?

(BTW, if someone outside of Open Climate Fix would like to tackle this issue then please shout and we can share a few batches of data!)

peterdudfield commented 2 years ago

There's some batch data, in tests/data folder, but it might need up dating to use all new variables

JackKelly commented 2 years ago

I've moved this issue to "critical" because it'd be really awesome to visualise the contents of each on-disk batch before we start training models (e.g. to check by eye that the different data sources are aligned in space and time, and aren't off by an hour, things like that!)

This could just be a Jupyter Notebook for visualising one batch at a time. Maybe animate the satellite image sequences and NWPs

peterdudfield commented 2 years ago

Trying to think if it's better to plot using Batch or BatchML or maybe both. Batch is nice because its the raw data rom nowcasting_dataset, but 'BatchML' is what the ML learnings model will see, so that will also be good to plot.

Perhaps a plotting function could take in 'PV' or 'PVML', and produce the plot (or subplot) from there.

JackKelly commented 2 years ago

ooh, yeah, good questions... I'm not sure what's best, either... I guess it'll be important for the plots to include the absolute positions in time and space (i.e. the OSGB coordinates for the spatial data; and the absolute datetimes for the timeseries data) so we can check by eye that the different modalities align in space and time... but the locations in both Batch and BatchML, right?

peterdudfield commented 2 years ago

Yea they are in both,

perhaps I should start with Batch, as thats at the beginning of the data-pipeline. As would be good to plot that first. Then as a step 2, do the BatchML. This way we can check Batch first as if Batch is wrong then BatchML will also be wrong, but easier to work out what is wrong

peterdudfield commented 2 years ago

Im going to try and split the PR up into differnt PRs for differetn data_sources, in order to keep the PRs small

JackKelly commented 2 years ago

SGTM! Thanks!

jacobbieker commented 2 years ago

Yeah, all sounds good, thanks!

peterdudfield commented 2 years ago

just to wet your appetite: pv

satellite

gsp

peterdudfield commented 2 years ago

I'm going work on some 'data_source' combined plots. What I was thinking

  1. animation of satelllite and pv/gsp
  2. time series of pv and gsp
  3. animation of satellite and nwp next to each other But very Welcome to your ideas .....

Just to sya, im trying to build all these plots up in a module way, so hoepfulyl itll be easy to add things together, or add other data_sources

JackKelly commented 2 years ago

Sounds good!

I'd also advocate for plotting real data as much as possible. I appreciate that fake data helps with testing but, given that we're running quite short on time, I think it's important to spend as much time as possible with eyes on "real" data, if that's possible?

jacobbieker commented 2 years ago

All sounds great!

peterdudfield commented 2 years ago

Sounds good!

I'd also advocate for plotting real data as much as possible. I appreciate that fake data helps with testing but, given that we're running quite short on time, I think it's important to spend as much time as possible with eyes on "real" data, if that's possible?

Yea I was hopping we can use the new Satellite and NWP data - but as that is taking a little longer, I should use real data

JackKelly commented 2 years ago

The NWP data should work... Just gotta test it.. Will do that after I've finished with leonardo and caught up on PRs