Add PV/Unified Position encoding

jacobbieker commented 3 years ago

Detailed Description

As mentioned in https://github.com/openclimatefix/satflow/issues/101 it would be helpful to have a way of having consistent position encodings for the PV systems/satellite imagery so that the model can associate the PV systems output in time and space with the satellite imagery.

This unified position encoding could also be useful for other modalities, so that the model only needs one set of position encodings but can use it for all the input modalities.

Context

It would help with the joint model and unifying the inputs/outputs. This position encoding can also be used with the https://github.com/openclimatefix/perceiver-pytorch/issues/20 to ensure the queries use the same positional encodings when getting the output

Possible Implementation

Could be just a way of encoding all of this with Fourier Features. Or could be some other way of encoding that works across the different modalities

jacobbieker commented 2 years ago

One option could be to have the Fourier features for the largest input, and essentially subsample them for the other inputs? for example, if we want to encode the PV position in the satellite input image, we take the Fourier features from the satellite image, and essentially crop it around the PV location? So the PV features are just the time Fourier features from the satellite image, and whatever features are for the pixel the PV system is in. We then just need to compute the features for the largest input, or most fine-grained input, and I think it would then work for the rest of them? For data that comes every 30min, we would just subselect along the time features, etc.

JackKelly commented 2 years ago

Sure, that sounds good! And ensures that the position encoding is the same for different modalities at the same position.

BTW, I can't remember if I've mentioned this before, but it would be good for us to experiment with "relative" positions and/or "absolute" positions. (I'm not sure if I'm using the correct terms here! By "relative" I mean things like "top left pixel" and "the second timestep". By "absolute" I mean the absolute real-world position: the latitude and longitude and the actual time-of-day and day-of-year etc.).

JackKelly commented 2 years ago

(I've moved this issue to nowcasting_dataloader :slightly_smiling_face: )

One other use-case that it'd be great to support is openclimatefix/nowcasting_dataset#135. Specifically: The NWPs are natively hourly. nowcasting_dataset linearly interpolates the hourly NWPs to 5-minutely before writing batches to disk. This interpolation is probably the wrong thing to do for fully-attentional ML models because the interpolation blows up the size of the input data without adding any information (in an "information-theoretic" sense :slightly_smiling_face:). Instead, I'd like to experiment with feeding Perceiver models with data where each modality is at its native sample rate (so NWPs would be hourly, satellite images would be 5-minutely, some PV would be 15-minutely, some 5-minutely; GSP-level PV would be half-hourly).

(This temporal interpolation is probably required for CNN models... if we continue experimenting with CNN models then we could temporally interpolate in nowcasting_dataloader... but that's a separate issue!)

Once we're feeding hourly NWPs into our models, we could also experiment with, for example, using NWPs which cover a longer time period than the satellite images. e.g. for the 'history' for each example, we could have 4 NWP timesteps (representing the last 4 hours) and 6 satellite timesteps (representing the last half hour). I mention this because this might suggest that it would be more flexible to separately compute the position encoding for each input modality (as long as the encoding is entirely deterministic and stays in sync across modalities!). But I don't want to impose an implementation: Do what ever you think is best! (And feel free to do this as a separate PR if you're already almost done with implementing position encoding!... or not at all, if you think it's a bad idea!)

jacobbieker commented 2 years ago

It seems like a way to go about this is to get the maximum spatial and temporal extant of all the input modalities, create an encoding based off that, and then 'slice' it up for each of the individual modalities. To me, that seems the easiest way to make sure they are all in sync, and then we still only need to compute the features once, and can reuse it for all the modalities for a given example.

Computing the position encoding for each one separately and keeping them in sync I think would work well for absolute position encodings, as then each one is already taken a 'slice' from the sin,cos datetime features across a whole year, and position features in lat/lon. That's how it is currently computed in the PR anyway, so I think that's good to go!

For relative position encodings, I think it gets a bit tricker. Either we can get the union of all the input modalities in terms of time covered, and intervals (so for the 4 NWPs and 6 satellite timesteps, the temporal part of the encoding would go from 4 hours into the past at 5 minute intervals) or they are separately encoded in terms of NWP modality has 4 timesteps, satellite has 6, but the relative encoding is relative to their own modality, so the model would have to figure out how they are relative to each other. I think I would prefer going with the first option for relative encoding, it could end up with lots of 'wasted' position encodings, but ensures the relative encodings are consistent across the modalities.

The PR is still very much a WIP, so can definitely make changes! I'll update the relative position encoding to more match the first option I wrote here, and for the absolute position, I think it is working the way you outlined.

JackKelly commented 2 years ago

This all sounds great to me! Thank you!

openclimatefix / nowcasting_dataloader