howweirdistheweather / weather_app

2 stars 0 forks source link

download recent data from CDS #2

Closed mbjones closed 11 months ago

mbjones commented 1 year ago

Request from @jamcinnes2 to download recent data:

I pushed some changes to our weather_app repo that will allow our tools to periodically get the latest cds era5 data. Could you pull the latest code and run cdstool.py to catch us up and download what we are missing?

I think starting it from 2021 will be adequate. Eventually I see cdstool.py and tiletool.py being called from a shell script that is run weekly.

You can run cdstool.py from the same data folder as before and it will download to the ./cds_era5/ directory. Note there are some new requirements for the python env. These are listed in the requirements.txt. (Also, the tool now downloads 1 day of data at a time. Tiletool.py is able to read the old yearly downloads we have, as well as the new daily downloads.) Anyway this would do it: python3 cdstool.py --startyear 2021 --forcedownload

mbjones commented 1 year ago

Ran cdstool, got recent data through day 33 of 2023. Data are now in a different format from prior years (daily files instead of summaries) so need to figure out the discrepancy. I think there is some duplicated data too.

mbjones commented 1 year ago

Started cdstool again to update 2023, which is in progress of downloading.

mbjones commented 1 year ago

Downloads through 2023 day 241 completed. The downloads are all producing a warning message of the form:

InsecureRequestWarning: Unverified HTTPS request is being made to host 'cds.climate.copernicus.eu'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#tls-warnings7.5M

Investigate and switch to secure https addresses if possible.

mbjones commented 1 year ago

Error generated when starting total_precipitation:

daily input global-2022-*-total_precipitation.nc could not be opened. variable tp : dimensions mismatch between master /var/data/hwitw/input/cds_era5/2022/total_precipitation/global-2022-001-total_precipitation.nc (('time', 'latitude', 'longitude')) and extension /var/data/hwitw/input/cds_era5/2022/total_precipitation/global-2022-335-total_precipitation.nc (('time', 'expver', 'latitude', 'longitude')) Trying yearly.
global-2022-total_precipitation.nc could not be opened! [Errno 2] No such file or directory: '/var/data/hwitw/input/cds_era5/2022/global-2022-total_precipitation.nc'

And related config error in the k8s mount for the cdstool job. Maybe I am not mounting the API key in the same place in both jobs. TBD.

  Warning  FailedMount  36m (x3 over 36m)  kubelet  MountVolume.SetUp failed for volume "config" : object "hwitw"/"hwitw-cdsapirc-secret" not registered
jamcinnes2 commented 1 year ago

Regarding this

InsecureRequestWarning: Unverified HTTPS request is being made to host 'cds.climate.copernicus.eu'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#tls-warnings7.5M

Investigate and switch to secure https addresses if possible.

I just ran cdstool on my computer and I did not get that warning. I also saw that all my connections to Copernicus servers were https. Perhaps you need a pip upgrade cdsapi

Or maybe your OS needs a root cert update?

jamcinnes2 commented 1 year ago

Error generated when starting total_precipitation:

daily input global-2022-*-total_precipitation.nc could not be opened. variable tp : dimensions mismatch between master /var/data/hwitw/input/cds_era5/2022/total_precipitation/global-2022-001-total_precipitation.nc (('time', 'latitude', 'longitude')) and extension /var/data/hwitw/input/cds_era5/2022/total_precipitation/global-2022-335-total_precipitation.nc (('time', 'expver', 'latitude', 'longitude')) Trying yearly.
global-2022-total_precipitation.nc could not be opened! [Errno 2] No such file or directory: '/var/data/hwitw/input/cds_era5/2022/global-2022-total_precipitation.nc'

And related config error in the k8s mount for the cdstool job. Maybe I am not mounting the API key in the same place in both jobs. TBD.


  Warning  FailedMount  36m (x3 over 36m)  kubelet  MountVolume.SetUp failed for volume "config" : object "hwitw"/"hwitw-cdsapirc-secret" not registered
`

That 'exp' data column showed up again somehow. We should be processing that out. I will check it on that global-2022-335-total_precipitation.nc

mbjones commented 12 months ago

I checked the python lib versions that are installed in the docker image -- the cdsapi package is version 2.6.1, which appears to be the latest from the github repo. Because I rebuild the python environment from the requirements.txt files from scratch and just before deployment, we get the most recently upgraded libraries from PyPi. Maybe we have too recent versions for some? If so, we could adjust the requirements.txt and hwitw_requirements.txt files to list specific package versions that we need, rather than the latest available. Below I list the installed packages on the current docker image. Also note I am still using python 3.9.1 which we were using a while ago, and I could switch that up to the latest 3.11 release if you think that would help.

bcrypt==4.0.1
blinker==1.6.2
cdsapi==0.6.1
certifi==2023.7.22
cffi==1.15.1
cftime==1.6.2
charset-normalizer==3.2.0
click==8.1.7
cloudpickle==2.2.1
cryptography==41.0.3
dask==2023.9.0
dill==0.3.7
Flask==2.3.3
Flask-Cors==4.0.0
fsspec==2023.9.0
globus-sdk==3.28.0
gunicorn==21.2.0
h5py==3.9.0
idna==3.4
importlib-metadata==6.8.0
itsdangerous==2.1.2
Jinja2==3.1.2
locket==1.0.0
MarkupSafe==2.1.3
netCDF4==1.6.4
numpy==1.25.2
packaging==23.1
pandas==2.1.0
paramiko==3.3.1
parsl==2023.8.28
partd==1.4.0
psutil==5.9.5
pycparser==2.21
PyJWT==2.8.0
PyNaCl==1.5.0
python-dateutil==2.8.2
pytz==2023.3
PyYAML==6.0.1
pyzmq==25.1.1
requests==2.31.0
setproctitle==1.3.2
six==1.16.0
tblib==2.0.0
toolz==0.12.0
tqdm==4.66.1
typeguard==2.13.3
typing_extensions==4.7.1
tzdata==2023.3
urllib3==2.0.4
Werkzeug==2.3.7
xarray==2023.8.0
zipp==3.16.2
mbjones commented 12 months ago

@jamcinnes2 I removed the files you listed in your email from the input directory, and restarted the cdstool job to get them again. Its in progress. Here's the files I removed.

/data/HWITW/input/cds_era5/2022/total_precipitation/global-2022-335-total_precipitation.nc
/data/HWITW/input/cds_era5/2022/cloud_base_height/global-2022-335-cloud_base_height.nc
/data/HWITW/input/cds_era5/2022/precipitation_type/global-2022-335-precipitation_type.nc
/data/HWITW/input/cds_era5/2023/total_precipitation/global-2023-182-total_precipitation.nc
/data/HWITW/input/cds_era5/2023/cloud_base_height/global-2023-182-cloud_base_height.nc
/data/HWITW/input/cds_era5/2023/precipitation_type/global-2023-182-precipitation_type.nc
mbjones commented 12 months ago

After fixing a few permissions problems (because I had made early years read-only to avoid accidentally deleting files), I restarted cdstool from 2021 and onward. It is downloading a lot from 2021, which makes me wonder if I need to go back and rerun it all the way from 1950. @jamcinnes2 do you think that is worthwhile?

mbjones commented 12 months ago

@jamcinnes2 ok, the download from 2021 to now finished, but the tiletool.py dies when trying to start in on processing the 2021 data. Looks like a data structure problem. Here's the end of the log:

download_dataset finished.
cleaning dataset cds_era5...
done.
** HWITW data processing tool v0.9.3 **

debug: start_week 41 num_weeks 52 total_num_hours 8736
Output hwglobal-temperature_and_humidity-2021.nc
Traceback (most recent call last):
  File "/cdstotile/tiletool.py", line 614, in <module>
    main()
  File "/cdstotile/tiletool.py", line 602, in main
    load_netcdfs( flag_args, input_path, output_path, start_year, end_year )
  File "/cdstotile/tiletool.py", line 410, in load_netcdfs
    process_data_group( flag_args, inp_path, out_path, dir_name, year, dg_name, dg );
  File "/cdstotile/tiletool.py", line 309, in process_data_group
    wk_var[week_i] = week_i + 1
  File "src/netCDF4/_netCDF4.pyx", line 5505, in netCDF4._netCDF4.Variable.__setitem__
  File "src/netCDF4/_netCDF4.pyx", line 5788, in netCDF4._netCDF4.Variable._put
  File "src/netCDF4/_netCDF4.pyx", line 2029, in netCDF4._netCDF4._ensure_nc_success
RuntimeError: NetCDF: Index exceeds dimension bound
jamcinnes2 commented 12 months ago

@mbjones ok I just pushed a change to tiletool.py that should fix this one. -johnm File "/cdstotile/tiletool.py", line 309, in process_data_group wk_var[week_i] = week_i + 1

mbjones commented 11 months ago

Thanks, @jamcinnes2 After fixing a quick syntax error in tiletool.py, I restarted the job, and grabbed some more data, but then tiletool failed again with a similar error:

  2 ** HWITW Copernicus data download tool v1.0 **
  3
  4 download_dataset finished.
  5 cleaning dataset cds_era5...
  6 done.
  7 ** HWITW data processing tool v0.9.3 **
  8
  9 debug: start_week 41 num_weeks 52 total_num_hours 8736
 10 Output hwglobal-temperature_and_humidity-2021.nc
 11 Traceback (most recent call last):
 12   File "/cdstotile/tiletool.py", line 617, in <module>
 13     main()
 14   File "/cdstotile/tiletool.py", line 605, in main
 15     load_netcdfs( flag_args, input_path, output_path, start_year, end_year )
 16   File "/cdstotile/tiletool.py", line 413, in load_netcdfs
 17     process_data_group( flag_args, inp_path, out_path, dir_name, year, dg_name, dg );
 18     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 19   File "/cdstotile/tiletool.py", line 310, in process_data_group
 20     wk_var.append(week_i + 1)
 21     ^^^^^^^^^^^^^
 22   File "src/netCDF4/_netCDF4.pyx", line 4909, in netCDF4._netCDF4.Variable.__getattr__
 23   File "src/netCDF4/_netCDF4.pyx", line 4631, in netCDF4._netCDF4.Variable.getncattr
 24   File "src/netCDF4/_netCDF4.pyx", line 1545, in netCDF4._netCDF4._get_att
 25   File "src/netCDF4/_netCDF4.pyx", line 2029, in netCDF4._netCDF4._ensure_nc_success
 26 AttributeError: NetCDF: Attribute not found

Seems like a datafile structure problem, so I removed the temperature input file for 2021 day 140 and 141 to grab a new copy, and sure enough they are different:

c91ba2bdc4987bc86ae5d0605f061ea23fa6eafc  global-2021-140-2m_temperature.nc
1d3d41a7dfb5b206cf04a7cb9f5cdb9497fe0f2c  global-2021-140-2m_temperature.nc

This makes me wonder if the format of the netcdf files changed and we need to go through and delete our current data copy and download everything again. Or at least reprocess/reformat them? Do you have a sense of that? It was a lot of time to download originally, and I'd hate to repeat it if we don't have to, so your advice would be appreciated. Is there a quick way to tell if each of the input files we have is in the right format and has the right contents?

To test if more recent files work, I tried setting START_YEAR=2022 to see if it works with more recently downloaded data, and from that I get a very different error:

** HWITW data processing tool v0.9.3 **

debug: start_week 52 num_weeks 52 total_num_hours 8736
Output hwglobal-temperature_and_humidity-2022.nc
Traceback (most recent call last):
  File "/cdstotile/tiletool.py", line 617, in <module>
    main()
  File "/cdstotile/tiletool.py", line 605, in main
    load_netcdfs( flag_args, input_path, output_path, start_year, end_year )
  File "/cdstotile/tiletool.py", line 413, in load_netcdfs
    process_data_group( flag_args, inp_path, out_path, dir_name, year, dg_name, dg );
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/cdstotile/tiletool.py", line 309, in process_data_group
    if wk_var.size-1 < week_i:
       ^^^^^^
UnboundLocalError: cannot access local variable 'wk_var' where it is not associated with a value

That one seemingly got all the way to week 52 of 2022, so my interpretation of that is that it got through the daily files summary, but then had a problem at the end?

Let me know how you'd like me to proceed. Note I pushed some code changes to github.

mbjones commented 11 months ago

Regarding the InsecureRequestWarning warning, I traced this down to a configuration problem in my cdsapirc config file, where I had mistakenly set verify: 0. By changing that to verify: 1 I no longer get any https validation warnings.

Regarding the other download issues, as far as I can tell we are now getting all of the data in the correct format, except one file in 2023 has an issue as described in #8 . So I am closing this download issue as I think the rest of the download is working as expected.