openclimatefix / nowcasting_dataset

Prepare batches of data for training machine learning solar electricity nowcasting data
https://nowcasting-dataset.readthedocs.io/en/stable/
MIT License
25 stars 6 forks source link

Need larger geo extent for topology data #638

Open JackKelly opened 2 years ago

JackKelly commented 2 years ago

In v15, the geo extent of the topology data doesn't quite cover all the extent of the satellite data:

In this figure, the grey skewed rectangle is the satellite data. The coloured square in the background is the topology data:

image

The fix is probably as simple as changing the config file to have more pixels of topology data.

peterdudfield commented 2 years ago

Maybe be linked with this - https://github.com/openclimatefix/nowcasting_dataset/issues/610

Obviously both data sets cant cover exactly the same, so just have to decide does topology > satellite, or satellite > topology

JackKelly commented 2 years ago

Yeah, good question!

From a selfish perspective - thinking only of the power_perceiver model - I would prefer not have a separate topology dataset on-disk, but instead for nowcasting_dataset to provide surface_height and land/sea mask coordinates in the satellite and NWP on-disk batches (please see issue #642 for more context). (Because I'm using surface_height as part of the position-encoding of each patch of satellite data, rather than providing surface_height as a separate modality).

But I'm not sure if this would work for your models, @peterdudfield and @jacobbieker ?

peterdudfield commented 2 years ago

I can see why that would be convenient - It potentially goes against our principle of 'module' data sources which so far has been useful.

One solution could be to for the topological data to contain

JackKelly commented 2 years ago

It potentially goes against our principle of 'module' data sources which so far has been useful

That's a good point. I don't think it would be too bad to include topological data with NWP and satellite batches because the topo data is so small and fast to process. The ability to produce different data sources in nowcasting_dataset feels most useful for the time-consuming data sources (which might take a few days to generate)

reprojected data, on osgb grid. This grid could be the same as NWP and or satellite so its easier to merge

Unfortunately, that won't be sufficient. The topo data is already in OSGB projection. To make it possible to align the topo data with the satellite data, the topo data has to be reprojected to geostationary projection.

So, if we wanted to keep producing the topo data as a separate set of files then we'd need:

TBH, this is starting to sound complicated! I really would lean towards storing the topo data as a third coord of the satellite and NWP data. That also feels semantically like a good fit: surface height is "just" the third spatial dimension, so feels like it fits with the x and y coords.

Or, another option is we just ignore topo data in nowcasting_dataset. At the moment, power_perceiver loads the topo data directly from the intermediate files (which are only something like 100 MBytes, so easy to load into memory). But that puts a lot of complexity into the data loader. That is, power_perceiver can't use the pre-prepared topo batches as they currently stand.

jacobbieker commented 2 years ago

TBH, this is starting to sound complicated! I really would lean towards storing the topo data as a third coord of the satellite and NWP data. That also feels semantically like a good fit: surface height is "just" the third spatial dimension, so feels like it fits with the x and y coords.

Yeah, I think having the topo data included in the NWP and satellite batches is probably the best way forward, its a lot simpler than these alternatives, and shouldn't be too difficult to include. And, like you said, the topo data is small and fast to generate, so I don't think it matters as much if its not a separate data source.

JackKelly commented 2 years ago

Sounds good :slightly_smiling_face:

FWIW, here's the power_perceiver code I wrote on Friday which:

  1. Loads the topo data into memory.
  2. Reprojects the topo data to the same projection as the satellite data (i.e. geostationary projection).
  3. Uses xarray.combine_by_coords (twice) to align the topo data with the satellite data.

It's by no means perfect!

peterdudfield commented 2 years ago

So for v17, keeping it simple by just increasing the area default is 64, so ill up it to 128

JackKelly commented 2 years ago

If we change the image size of the HRV satellite data to something like 256 pixel wide x 128 pixels tall then maybe the topo data should be something like 300 wide x 200 tall? The topo data takes up so little space on disk that we can probably over-shoot :slightly_smiling_face:

peterdudfield commented 2 years ago

cool, ok ill move it to 300