openclimatefix / nwp

Tools for downloading and processing numerical weather predictions
MIT License
9 stars 3 forks source link

Download latest UKV GRIB files onto `leonardo` from CEDA #21

Closed JackKelly closed 1 year ago

devsjc commented 1 year ago

A couple of clarifications, if possible:

  1. What date range of data do we want to cover?
  2. Where on leonardo is most suitable to save this (grib and zarr files)?
JackKelly commented 1 year ago

Sure!

GRIB files go in: /mnt/storage_b/data/ocf/solar_pv_nowcasting/nowcasting_dataset_pipeline/NWP/UK_Met_Office/UKV/native/

Please note the directory structure already present in that directory.

This directory already has files up to Feb 2023. So please can we download the files from the end of the current GRIB dataset, to today?

Zarr datasets go in: /mnt/storage_b/data/ocf/solar_pv_nowcasting/nowcasting_dataset_pipeline/NWP/UK_Met_Office/UKV/zarr/

@jacobbieker is this correct?

jacobbieker commented 1 year ago

That directory is correct. The files from Oct/November 2021 to Feb 2023 aren't actually GRIB files apparently, when I tried appending them with the script in here to the NWP zarr, it threw errors, and apparently when I downloaded them the contents aren't right. So would recommend redownloading from the last date in the NWP Zarr.

The path for the zarr dataset is also correct. There are currently 3 variants of the NWP zarr in /mnt/storage_c/ but they should probably just be moved over for now.

The directory structure in the /native/ folder matches CEDA, as the data was downloaded by wget mostly, so should match CEDA exactly from the root of the CEDA NWP data.

devsjc commented 1 year ago

I see what you're saying about the structure of the gribs - I'm having trouble with the Zarr conversion also. What I'll do so as to not hold up proceedings is set it downloading just the native files for now whilst in investigate the conversion side of things.

devsjc commented 1 year ago

One thing worth raising is that in the consumer so far I've avoided downloading anything beyond Wholesale2, as the variables in Wholesale3 onwards are not ones that are available in the live data we pull from MetOffice. Do you still want everything pulling from CEDA as before, or just those with parameters we can access in the live datasets?

Also, those dodgy grib files on leonardo are actually HTML files explaining that you aren't logged in to CEDA! No wonder zarr didn't like them!

jacobbieker commented 1 year ago

I think we would just want the ones that are available during live data, so yeah, I would ignore the wholesale3 ones, right @JackKelly?

And that makes a lot more sense! Thanks for figuring that out

devsjc commented 1 year ago

Should I delete the HTML datasets back to the correct time in 2021 and then redownload from there to now?

jacobbieker commented 1 year ago

Yeah, that would be great

devsjc commented 1 year ago

Ah I'm not sure I have permission to do that actually, seems they are write protected and I can't delete them nor change the perms to enable it. The first HTML files appear for the 1800 hours run in 2021/11/08.

jacobbieker commented 1 year ago

Oh okay, I might have to, I can start that

jacobbieker commented 1 year ago

I've started deleting those ones, it should be free soon

jacobbieker commented 1 year ago

Should be done now

devsjc commented 1 year ago

Cheers @jacobbieker!

devsjc commented 1 year ago

@JackKelly you mentioned you wanted to play with the full 52 hour forecasting horizon, by which I assume you're meaning the 52 time steps in each run. However again currently in the consumer I'm limiting it to downloading e.g. Wholesale1 and not Wholesale1T54 as Wholesale1 contains steps 0-36 which already is more than we pull in live: that data only goes up to 12 steps. I can remove these limits of course and pull the T54 files as well, but just wanted to let you know that models trained using the full forecast horizon will be starved for that data when running in production currently.

All these concessions are made in the name of trying to keep consistency between the live and historic data, as it was my undestanding that was the desirable end goal - however, if instead it's better to have more available for training such that we can better inform what is pulled from Live, that also makes sense to me as a motive. Let me know what you think is best!

JackKelly commented 1 year ago
JackKelly commented 1 year ago

although, I should check: is it possible to get 54hr forecasts from UK Met Office in production?!

devsjc commented 1 year ago

Even so, I'd be very keen to have 54-hour NWPs for the full history for our R&D

I'll make sure that downloads as well then.

is it possible to get 54hr forecasts from UK Met Office in production?

Yes it is, althoug we would have to further upgrade our subscription as that would more than quadruple our data throughput!

devsjc commented 1 year ago

Just discovered I also don't have permission to use docker on leonardo either! Does anyone, or is it just not installed?

JackKelly commented 1 year ago

Ah, oops, sorry - I've just checked on leonardo. It turns out that you had minimal permissions (the only group you were assigned to was the sol group!). I've just added you to these groups: docker sudo ocf.

(If anyone needs to add users to a Linux machine in the future, please see these step-by-step instructions for how to add users to the correct groups)

JackKelly commented 1 year ago

A quick update (copying relevant info from OCF's internal Slack to this GitHub issue)...

Sol successfully download all the files from CEDA onto Leonardo, but the downloaded files only go up to 2022-12-14, even though the CEDA web interface shows files going up to 2023-05-05. And FileZilla can download files up to 2023-05-05 (using Jack's CEDA credentials) from ftp://ftp.ceda.ac.uk/badc/ukmo-nwp/data/ukv-grib/: image

Jack will take responsibility for downloading the remaining files using wget.

JackKelly commented 1 year ago

I think I've fixed scripts/download_UK_Met_Office_NWPs_from_CEDA.sh.

I'm running it now on leonardo. I'm running it over the full time horizon of the UKV dataset, so it'll download any missing files it finds. e.g. there were some missing for 2021-11-08.

It'll download Wholesale1 and 2 files, including the T54 files.

JackKelly commented 1 year ago

OK, the script appears to have worked, so we now have data from 2016 to 2023-05-05T18:00