openclimatefix / nowcasting_dataset

Prepare batches of data for training machine learning solar electricity nowcasting data
https://nowcasting-dataset.readthedocs.io/en/stable/
MIT License
25 stars 6 forks source link

Use smaller dtypes for saved data #61

Open JackKelly opened 3 years ago

JackKelly commented 3 years ago

int16 for both NWP & satellite? Maybe even uint8 (and use 0 for NaN???)

JackKelly commented 3 years ago

Saving the prepared satellite data as int16 instead of float32 reduces the average batch size on disk (after compression) to about 42 MB, from about 52 MB. So not a huge improvement. But it doesn't make the code much more complicated, and it will save a little on storage costs and load times, so prob worth doing.

peterdudfield commented 3 years ago

Might as well - small amount of work, for small win

JackKelly commented 3 years ago

yeah, the only slightly fiddly bit is re-scaling the original float values (which often have arbitrary lower and upper bounds) to fit within, say, the range [0, 1023]. Which requires us to find the min and max values across the entire dataset... which is a little time-consuming! And, of course, we need to invent a convention for representing NaNs in integer world (e.g. -1).

Zarr & Dask should be able to compute mins and maxes from huge datasets in just a few lines of Python, but my limited experience is that Dask often runs out of memory and crashes before returning an answer!

JackKelly commented 3 years ago

I should explain the [0, 1023] range... in my limited experiments, it looks like we can reduce the filesize on disk by using less bits per pixel than the data type. e.g. if we use 10 bits per pixel (giving a range of [0, 1023]) within an int16 dtype (by only using the bottom 10 bits of the 16-bit ints) then the file sizes are even smaller... and, IIRC, the EUMETSAT sensors are 10-bit-per-pixel, so there's probably little point storing more than 10 bits per pixel!

peterdudfield commented 2 years ago

Moving back here

Do you mind expanding on what needs doing for the NWP data?

Sure! The short answer is that we want to maintain as much information as possible. I'll illustrate with a slightly cartoonish worst-case example: Let's pretend that NWPs measured irradiance in units of kW per meter squared. If that were true, then the values for irradiance would always be between 0.0 and about 0.7 kW/m2 in the UK. If we just did irradiance.astype(np.int16) then every value would be converted to 0. We'd lose all the information! Instead, we want to "stretch" the irradiance values so they fill int16's full range. int16 can represent values in the range [-32,768, 32,767]. But, for ease, we probably just want to use the positive numbers [0, 32,767]; and use -1 to represent NaNs. So, we need to do something like this:

# Rescale to [0, 1]:
nwp -= nwp.min()
nwp /= nwp.max()

# Rescale to [0, 32,767]:
nwp *= 2**15 - 1

# Fill NaNs with -1 (because int16 has no concept of "NaN")
nwp = nwp.fillna(-1)

# Convert to int16
nwp = nwp.astype(np.float16)

But, in order to rescale like this, we need to know the min and max for every NWP variable, across every value in time and space. Then, in nowcasting_dataloader, we'd need to normalise using the standard deviation and mean computed using the rescaled values.

Ok, I think ill do the satellite int16 and then think again use the nwp data a the way you have described

peterdudfield commented 2 years ago

Would be interesting to know if some default compression would be the same as doing this rescaling (see above). Probably only way it to try it out ...

JackKelly commented 2 years ago

Here are some benchmarks I did for EUMETSAT data about a year ago.

Which actually seems to say that uint16 is only a tiny bit smaller than float16 (529 MB vs ). So maybe the low-hanging fruit for NWPs is to just do astype(np.float16) (and then we don't have to bother with rescaling.

But if we really want to reduce the space used on disk (and increase read times) then the trick is to only use, say, 10 bits per channel (i.e. to rescale to [0, 1023], and hence only use the 10 "bottom" bits of int16). Or even 8 bits per channel and use int8.

peterdudfield commented 2 years ago

I was just about to type the same, so low hanging fruit, change it to float16

the test data was

float32 = 13M float16 = 4.6M int16 = 3 M (but possible information loss)

JackKelly commented 2 years ago

Yeah, exactly. float16 is a bit less precise but it's no biggie.

(Oops! I just spotted that I made a mistake in my last comment! In my tests, float16 was smaller than uint16 (441 MB vs 529 MB. Both using zstd clevel=5)

peterdudfield commented 2 years ago

Ive pushed that now to PR #335

peterdudfield commented 2 years ago

Ill keep it open, incase we do want to go down the int8/int16 route