openclimatefix / nowcasting_dataset

Prepare batches of data for training machine learning solar electricity nowcasting data
https://nowcasting-dataset.readthedocs.io/en/stable/
MIT License
25 stars 6 forks source link

Load data into memory if live #660

Closed jacobbieker closed 2 years ago

jacobbieker commented 2 years ago

Pull Request

Description

Loads the Zarr arrays into memory if it is being run as live, as otherwise the Zarr store might disappear in the middle from being updated by the 5 minutely process.

Fixes https://github.com/openclimatefix/nowcasting_forecast/issues/68

How Has This Been Tested?

Please describe the tests that you ran to verify your changes. Provide instructions so we can reproduce. Please also list any relevant details for your test configuration

If your changes affect data processing, have you plotted any changes? i.e. have you done a quick sanity check?

Checklist:

codecov-commenter commented 2 years ago

Codecov Report

Merging #660 (1969c6f) into main (8036158) will increase coverage by 0.00%. The diff coverage is 100.00%.

@@           Coverage Diff           @@
##             main     #660   +/-   ##
=======================================
  Coverage   93.24%   93.25%           
=======================================
  Files          47       47           
  Lines        3197     3200    +3     
=======================================
+ Hits         2981     2984    +3     
  Misses        216      216           
Impacted Files Coverage Δ
...et/data_sources/satellite/satellite_data_source.py 93.78% <100.00%> (+0.11%) :arrow_up:

:mega: Codecov can now indicate which changes are the most critical in Pull Requests. Learn more

allcontributors[bot] commented 2 years ago

@peterdudfield

I've put up a pull request to add @jacobbieker! :tada:

JackKelly commented 2 years ago

how bullet-proof is this? what happens if the Zarr store on disk disappears while it's being loaded into RAM? :slightly_smiling_face: Or if the chunks on disk change after the metadata has been loaded?

Should we consider saving new Zarrs to different Zarr filenames, and deleting them after, say, 6 hours? So we can guarantee that a Zarr won't disappear while it's being read (assuming it's being read less than 6 hours after the Zarr was created!)

jacobbieker commented 2 years ago

how bullet-proof is this? what happens if the Zarr store on disk disappears while it's being loaded into RAM? :slightly_smiling_face: Or if the chunks on disk change after the metadata has been loaded?

Should we consider saving new Zarrs to different Zarr filenames, and deleting them after, say, 6 hours? So we can guarantee that a Zarr won't disappear while it's being read (assuming it's being read less than 6 hours after the Zarr was created!)

Yeah, there is the issue of race conditions, we've tried to minimize it by writing to a different file then renaming after it's finished which is fairly quick. @peterdudfield and I are planning on making it more robust. Having different filenames does sound like a good alternative!

JackKelly commented 2 years ago

writing to a different file then renaming after it's finished

Great idea!! If renaming is an atomic operation on AWS then I would guess that's bullet-proof already? (but I guess that even if renaming a single file is atomic; renaming all the files in a Zarr is unlikely to be atomic?)

Awesome work, BTW! I'm super-impressed at home much stuff you guys are getting done!

jacobbieker commented 2 years ago

writing to a different file then renaming after it's finished

Great idea!! If renaming is an atomic operation on AWS then I would guess that's bullet-proof already? (but I guess that even if renaming a single file is atomic; renaming all the files in a Zarr is unlikely to be atomic?)

Awesome work, BTW! I'm super-impressed at home much stuff you guys are getting done!

I don't know if it's atomic, I think it should be, but not sure. But each zarr is a single file, it's stored as a zip file like I've been doing for HF datasets and that seems to be working quite well.

JackKelly commented 2 years ago

nice! According to this, moving a single file is atomic on AWS. So I think that means that what you've already built should be bullet-proof. Nice!

If it's not then another option might be a good-old-fashioned LOCKFILE. But let's assume your current implementation is bullet-proof :slightly_smiling_face: