Closed JackKelly closed 1 year ago
Sure!
GRIB files go in: /mnt/storage_b/data/ocf/solar_pv_nowcasting/nowcasting_dataset_pipeline/NWP/UK_Met_Office/UKV/native/
Please note the directory structure already present in that directory.
This directory already has files up to Feb 2023. So please can we download the files from the end of the current GRIB dataset, to today?
Zarr datasets go in: /mnt/storage_b/data/ocf/solar_pv_nowcasting/nowcasting_dataset_pipeline/NWP/UK_Met_Office/UKV/zarr/
@jacobbieker is this correct?
That directory is correct. The files from Oct/November 2021 to Feb 2023 aren't actually GRIB files apparently, when I tried appending them with the script in here to the NWP zarr, it threw errors, and apparently when I downloaded them the contents aren't right. So would recommend redownloading from the last date in the NWP Zarr.
The path for the zarr dataset is also correct. There are currently 3 variants of the NWP zarr in /mnt/storage_c/
but they should probably just be moved over for now.
The directory structure in the /native/
folder matches CEDA, as the data was downloaded by wget
mostly, so should match CEDA exactly from the root of the CEDA NWP data.
I see what you're saying about the structure of the gribs - I'm having trouble with the Zarr conversion also. What I'll do so as to not hold up proceedings is set it downloading just the native files for now whilst in investigate the conversion side of things.
One thing worth raising is that in the consumer so far I've avoided downloading anything beyond Wholesale2, as the variables in Wholesale3 onwards are not ones that are available in the live data we pull from MetOffice. Do you still want everything pulling from CEDA as before, or just those with parameters we can access in the live datasets?
Also, those dodgy grib files on leonardo are actually HTML files explaining that you aren't logged in to CEDA! No wonder zarr didn't like them!
I think we would just want the ones that are available during live data, so yeah, I would ignore the wholesale3 ones, right @JackKelly?
And that makes a lot more sense! Thanks for figuring that out
Should I delete the HTML datasets back to the correct time in 2021 and then redownload from there to now?
Yeah, that would be great
Ah I'm not sure I have permission to do that actually, seems they are write protected and I can't delete them nor change the perms to enable it. The first HTML files appear for the 1800 hours run in 2021/11/08.
Oh okay, I might have to, I can start that
I've started deleting those ones, it should be free soon
Should be done now
Cheers @jacobbieker!
@JackKelly you mentioned you wanted to play with the full 52 hour forecasting horizon, by which I assume you're meaning the 52 time steps in each run. However again currently in the consumer I'm limiting it to downloading e.g. Wholesale1 and not Wholesale1T54 as Wholesale1 contains steps 0-36 which already is more than we pull in live: that data only goes up to 12 steps. I can remove these limits of course and pull the T54 files as well, but just wanted to let you know that models trained using the full forecast horizon will be starved for that data when running in production currently.
All these concessions are made in the name of trying to keep consistency between the live and historic data, as it was my undestanding that was the desirable end goal - however, if instead it's better to have more available for training such that we can better inform what is pulled from Live, that also makes sense to me as a motive. Let me know what you think is best!
Wholesale3
: I agree, let's ignore these for now. (I could imagine wanting to use these in the future. But, yeah, let's ignore wholesale3 for now!)Wholesale1T54
: That's a good point that 54-hour data isn't available in prod (yet). Even so, I'd be very keen to have 54-hour NWPs for the full history for our R&D. One reason is so we can - in R&D - figure out how our models perform out to 54 hours. And the more pressing reason is that our project with Smith Institute requires a "backtest" of 5 years of forecasts, where each forecast extends to 48 hours. And, if my "medium-complexity national PV forecast" works well, then I'll generate the 5 year backtest using the "medium complexity" model. Which can be done entirely in an "R&D setting" (i.e. not in prod).although, I should check: is it possible to get 54hr forecasts from UK Met Office in production?!
Even so, I'd be very keen to have 54-hour NWPs for the full history for our R&D
I'll make sure that downloads as well then.
is it possible to get 54hr forecasts from UK Met Office in production?
Yes it is, althoug we would have to further upgrade our subscription as that would more than quadruple our data throughput!
Just discovered I also don't have permission to use docker on leonardo either! Does anyone, or is it just not installed?
Ah, oops, sorry - I've just checked on leonardo
. It turns out that you had minimal permissions (the only group you were assigned to was the sol
group!). I've just added you to these groups: docker sudo ocf
.
(If anyone needs to add users to a Linux machine in the future, please see these step-by-step instructions for how to add users to the correct groups)
A quick update (copying relevant info from OCF's internal Slack to this GitHub issue)...
Sol successfully download all the files from CEDA onto Leonardo, but the downloaded files only go up to 2022-12-14, even though the CEDA web interface shows files going up to 2023-05-05. And FileZilla can download files up to 2023-05-05 (using Jack's CEDA credentials) from ftp://ftp.ceda.ac.uk/badc/ukmo-nwp/data/ukv-grib/
:
Jack will take responsibility for downloading the remaining files using wget
.
I think I've fixed scripts/download_UK_Met_Office_NWPs_from_CEDA.sh
.
I'm running it now on leonardo
. I'm running it over the full time horizon of the UKV dataset, so it'll download any missing files it finds. e.g. there were some missing for 2021-11-08.
It'll download Wholesale1 and 2 files, including the T54 files.
OK, the script appears to have worked, so we now have data from 2016 to 2023-05-05T18:00
A couple of clarifications, if possible: