Closed JackKelly closed 2 years ago
There's some batch data, in tests/data folder, but it might need up dating to use all new variables
I've moved this issue to "critical" because it'd be really awesome to visualise the contents of each on-disk batch before we start training models (e.g. to check by eye that the different data sources are aligned in space and time, and aren't off by an hour, things like that!)
This could just be a Jupyter Notebook for visualising one batch at a time. Maybe animate the satellite image sequences and NWPs
Trying to think if it's better to plot using Batch
or BatchML
or maybe both.
Batch
is nice because its the raw data rom nowcasting_dataset, but 'BatchML' is what the ML learnings model will see, so that will also be good to plot.
Perhaps a plotting function could take in 'PV' or 'PVML', and produce the plot (or subplot) from there.
ooh, yeah, good questions... I'm not sure what's best, either... I guess it'll be important for the plots to include the absolute positions in time and space (i.e. the OSGB coordinates for the spatial data; and the absolute datetimes for the timeseries data) so we can check by eye that the different modalities align in space and time... but the locations in both Batch
and BatchML
, right?
Yea they are in both,
perhaps I should start with Batch
, as thats at the beginning of the data-pipeline. As would be good to plot that first.
Then as a step 2, do the BatchML
. This way we can check Batch
first as if Batch
is wrong then BatchML
will also be wrong, but easier to work out what is wrong
Im going to try and split the PR up into differnt PRs for differetn data_sources, in order to keep the PRs small
SGTM! Thanks!
Yeah, all sounds good, thanks!
just to wet your appetite:
I'm going work on some 'data_source' combined plots. What I was thinking
Just to sya, im trying to build all these plots up in a module way, so hoepfulyl itll be easy to add things together, or add other data_sources
Sounds good!
I'd also advocate for plotting real data as much as possible. I appreciate that fake data helps with testing but, given that we're running quite short on time, I think it's important to spend as much time as possible with eyes on "real" data, if that's possible?
All sounds great!
Sounds good!
I'd also advocate for plotting real data as much as possible. I appreciate that fake data helps with testing but, given that we're running quite short on time, I think it's important to spend as much time as possible with eyes on "real" data, if that's possible?
Yea I was hopping we can use the new Satellite and NWP data - but as that is taking a little longer, I should use real data
The NWP data should work... Just gotta test it.. Will do that after I've finished with leonardo and caught up on PRs
Detailed Description
As @peterdudfield shows in openclimatefix/nowcasting_dataset#178, sometimes the best way to identify bugs in data is to plot the data :)
It would be awesome to have a tool which "simply" plots / animates all the data in each example, so we can visually check that the data looks sensible (e.g. check that PV output decreases as clouds pass overhead, etc.)
Possible Implementation
Maybe this should live in
nowcasting_utils
?(BTW, if someone outside of Open Climate Fix would like to tackle this issue then please shout and we can share a few batches of data!)