raspstephan / nwp-downscale

MIT License
36 stars 8 forks source link

Data download pipeline #6

Open raspstephan opened 3 years ago

raspstephan commented 3 years ago

After switch to Azure

Data period is controlled by available MRMS data. Region is CONUS. We could use Snakemake as for WeatherBench for the data downloads to catch errors.

raspstephan commented 3 years ago

@HirtM We need to decide which variables to download from TIGGE. You can see the available variables here: https://apps.ecmwf.int/datasets/data/tigge/levtype=sfc/type=cf/ (Also note the different levels on the left). Unfortunately, TIGGE doesn't archive the stratiform and convective precipitation separately. This could have been important information but oh well.

In addition to Total Precipitation, what do you think could be useful? Note the potential storage limitations: https://github.com/raspstephan/nwp-downscale/issues/4#issuecomment-744003517

HirtM commented 3 years ago
  1. Precipitation
  2. Mid-tropospheric (e.g. 500hPa) u, v wind
  3. CAPE
  4. CIN
  5. I think we need some moisture information: Total column water (is this also water vapor, or just fluid water?), or specific humidity at some level, but I think an integrated variable might be better.

Orography (this should be time-independent) Land-sea mask (also time-independant)

In addition, the following variables may be worthwhile:

Others we might consider, but I don't fully understand how they would contribute helpful information:

Also, do we want to distinguish between snow and rain? This would mean that our output is not just one 2d field, but two. I would assume it's okay to ignore snow for now.

raspstephan commented 3 years ago

Thanks a lot @HirtM! That's a lot of fields. We probably should start with the most important ones to keep data volumes down and prevent making the training too slow.

Here is what I would start with based on your considerations:

  1. Precip, of course
  2. U/V at 500hPa (to understand movement)
  3. Integrated water vapor (to estimate precipitation potential even if it doesn't rain in the IFS)
  4. 2m temperature (I bet there are a lot of correlations between precip and t2m. This maybe is "cheating" but we don't care)
  5. CAPE/CIN

I think it's also really important to include orography. But ideally we would feed in the high-resolution orography. So it would need to come in at a different point in the network. Same for land-sea mask even though we probably will start with mostly land points?

Do you agree with my ranking? We can always download more later.

HirtM commented 3 years ago

Do you agree with my ranking? We can always download more later.

Yes, I agree! Sounds like a good start.

raspstephan commented 3 years ago

@anna-184702, here are some pointers for downloading the MRMS data.

Here are the links for the data archive: https://mesonet.agron.iastate.edu/archive/

There is a link to the Google Drive files and there also is a local cache on a server but it doesn't contain all the data. I have so far only used the cached data. So unfortunately, we will have to use Google Drive but you mentioned that you have some experience with that.

The data comes in hourly zip files. Each zip file contains all the variables, which is why they are quite big (4G) and take a while to download. Here are all the variables.

image

We only need a small subset of all these variables. I would suggest the following but am open to suggestions from you and @HirtM.

'MultiSensor_QPE_06H_Pass1', 'MultiSensor_QPE_06H_Pass2', 'RadarOnly_QPE_06H', 'MultiSensor_QPE_03H_Pass1', 'MultiSensor_QPE_03H_Pass2', 'RadarOnly_QPE_03H', 'RadarQualityIndex'

In fact, we only need the 6h accumulations every 6 hours (0/6/12/18) to match the TIGGE data. However, I would also download the 3h accumulations every 3 hours, just in case we want to use the YOPP data later.

I already wrote some code to download, extract (first .zip, then each variable is in a .gz) and save the relevant files here. Note that there will be some missing files, i.e. some variables are missing at some time steps. The code should already catch that.

One thing I did not do in my code is convert from grib to nc since xarray can already read grib. Do you think we should do this right away so we never have to deal with grib? We can probably worry about this later when we are thinking about how to get the data into the NN.

raspstephan commented 3 years ago

I hacked together a tigge download script. Pretty ugly but it works for deterministic. For ensemble there is still some bug and the data are quite large so I can't download monthly batches for all 50 ensemble members... I will fix it somehow. Next, I will write a regridding script.

The raw tigge data for the control run should be downloaded in /datadrive/tigge/raw. @HirtM Feel free to do some analysis :)

annavaughan commented 3 years ago

@raspstephan thanks for this! I plan to try downloading the MRMS data tomorrow - I have some colab scripts for downloading from google drive which will hopefully work 😁

raspstephan commented 3 years ago

@anna-184702 If you haven't started yet, maybe wait with the MRMS downloading. I might have found a private data source that has MRMS data for several years.

raspstephan commented 3 years ago

I also wrote a quick download script for the MRMS data that is cached on the server. It's currently running so that we should have a few months of data soon. Soon means a couple of days because downloading the data takes forever.

annavaughan commented 3 years ago

@raspstephan I've finally got the MRMS data from September 2019 - May 2020 off that google drive 🎉. Do we want the 3hr accumulations or only 6hr at this stage? It looks like /datadrive/mrms/raw/MultiSensor_QPE_03H_Pass1 and /datadrive/mrms/raw/MultiSensor_QPE_03H_Pass2 are empty

raspstephan commented 3 years ago

Wow, amazing! Thanks so much. For now we are only using 6h but keeping 3h probably doesn't hurt. Yes, it seems the multi-sensor data in only available for the most recent few months, so we have to use RadarOnly. Let me know if you need help getting the data over to Azure.

raspstephan commented 3 years ago

HRRR data now seems to be on AWS! https://registry.opendata.aws/noaa-hrrr-pds/

HREF is here: https://data.nssl.noaa.gov/thredds/catalog/FRDD/HREFv2.html

annavaughan commented 3 years ago

Wow, amazing! Thanks so much. For now we are only using 6h but keeping 3h probably doesn't hurt. Yes, it seems the multi-sensor data in only available for the most recent few months, so we have to use RadarOnly. Let me know if you need help getting the data over to Azure.

It seems that the multi-sensor is available from September 2019 on in the google drive - I'll download as much as I can find. Still looking for archived data further back too.

@raspstephan the radar only 1hr accumulations are available back to 2014. So if we need more data one option would be to generate 6hrly accumulations from that. The multi-sensor seems to be better quality though

annavaughan commented 3 years ago

The new data is available at /datadrive/mrms/raw_drive

raspstephan commented 3 years ago

HRRR data is now downloading and regridding. Should be done for a year of data in a day or two.

@HirtM One thing to be aware of is that the HRRR domain is a little smaller than the radar coverage. You can see it at the bottom of the figure below, where the precipitation is cut off. Currently, these values are simply zero. We just need to make sure to consider this for evaluation.

image

raspstephan commented 3 years ago

Now that we have more MRMS data (yay!), let's use 2018/19 for training and 2020 for validation. I am currently downloading all the required data. This will take a week at least though. In the meantime, we can just keep using the few months of data we already have.

annavaughan commented 3 years ago

Now that we have more MRMS data (yay!), let's use 2018/19 for training and 2020 for validation. I am currently downloading all the required data. This will take a week at least though. In the meantime, we can just keep using the few months of data we already have.

Thanks @raspstephan!

raspstephan commented 3 years ago

@annavaughan @HirtM I just saw that we've run out of disk space on the CPU machine. I will add more storage tomorrow.

raspstephan commented 3 years ago

@annavaughan @HirtM I just saw that we've run out of disk space on the CPU machine. I will add more storage tomorrow.

We now have 3TB on the CPU machine which should be enough for a little bit.