google / weather-tools

Tools to make weather data accessible and useful.
https://weather-tools.readthedocs.io/
Apache License 2.0
216 stars 40 forks source link

weather-mv bq raster issue while reading ecmwf grib file #347

Closed heyanand closed 1 year ago

heyanand commented 1 year ago

weather-mv is unable to read ecmwf grib files with the following set of Raster errors being thrown out. Here is the command trace:

`weather-tools) jupyter@gi-asset:~/weather-tools$ python weather_mv/weather-mv bigquery --uris "gs://gi_asset-ecmwf-ensemble-data/hres-sample/single-levelshres.gb" --output_table "$PROJECT.ecmwf.hres-sample-check" --temp_location "gs://gi_asset-ecmwf-ensemble-data/tmp/" --runner DataflowRunner --num_workers 2 --project $PROJECT --region us-central1 --job_name hrse-sample-mv-003 --disk_size_gb 200 no previously-included directories found matching 'test_data' INFO:loader_pipeline.bq:Validating regions for data migration. This might take a few seconds... INFO:loader_pipeline.bq:Region validation completed successfully. INFO:apache_beam.internal.gcp.auth:Setting socket default timeout to 60 seconds. INFO:apache_beam.internal.gcp.auth:socket default timeout is 60.0 seconds. INFO:apache_beam.io.gcp.gcsio:Starting the file information of the input INFO:apache_beam.io.gcp.gcsio:Finished listing 1 files in 0.05787062644958496 seconds. WARNING:loader_pipeline.sinks:Assuming grib. INFO:loader_pipeline.sinks:Normalizing the grib schema, name of the data variables will look like '<attrs['GRIBstepType']>'. ERROR:loader_pipeline.sinks:Unable to open file 'gs://gi_asset-ecmwf-ensemble-data/hres-sample/single-levels_hres_2020-01-01T00_00_00z-u100-v100-u10-v10-u200-v200-2t-2d-ssr-str-sp-msl-tprate-ptype-blh-sr-tp.gb': only size-1 arrays can be converted to Python scalars Traceback (most recent call last): File "weather_mv/weather-mv", line 74, in cli(['--extra_package', pkg_archive]) File "/home/jupyter/weather-tools/weather_mv/loader_pipeline/init.py", line 23, in cli pipeline(run(sys.argv + extra)) File "/home/jupyter/weather-tools/weather_mv/loader_pipeline/pipeline.py", line 71, in pipeline paths | "MoveToBigQuery" >> ToBigQuery.from_kwargs(vars(known_args)) File "/home/jupyter/weather-tools/weather_mv/loader_pipeline/sinks.py", line 55, in from_kwargs return cls({k: v for k, v, in kwargs.items() if k in fields}) File "", line 17, in init File "/home/jupyter/weather-tools/weather_mv/loader_pipeline/bq.py", line 154, in post_init__ with open_dataset(self.first_uri, self.xarray_open_dataset_kwargs, File "/opt/conda/envs/weather-tools/lib/python3.8/contextlib.py", line 113, in enter return next(self.gen) File "/home/jupyter/weather-tools/weather_mv/loader_pipeline/sinks.py", line 385, in open_dataset xr_dataset: xr.Dataset = open_dataset_file(local_path, File "/home/jupyter/weather-tools/weather_mv/loader_pipeline/sinks.py", line 311, in open_dataset_file return _add_is_normalized_attr(normalize_grib_dataset(filename), True) File "/home/jupyter/weather-tools/weather_mv/loader_pipeline/sinks.py", line 229, in __normalize_grib_dataset forecast_hour = int(da.step.values / np.timedelta64(1, 'h')) TypeError: only size-1 arrays can be converted to Python scalars

(weather-tools) jupyter@gi-python:~/weather-tools$ python weather_mv/weather-mv bigquery --uris "gs://gi_asset-ecmwf-ensemble-data/hres-sample/2020-01-01T00:00:00z-u100-v100-u10-v10-u200-v200-2t-2d-ssr-str-sp-msl-tprate-ptype-blh-sr-tp.gb" --output_table "megatron-389205.ecmwf.hres-sample-check" --temp_location "gs://gi_asset-ecmwf-ensemble-data/tmp/loadprocess/" --runner DataflowRunner --num_workers 2 --project megatron-389205 --region us-central1 --job_name hrse-sample-mv-003 --disk_size_gb 200 --disable_grib_schema_normalization no previously-included directories found matching 'test_data' INFO:loader_pipeline.bq:Validating regions for data migration. This might take a few seconds... INFO:loader_pipeline.bq:Region validation completed successfully. INFO:apache_beam.internal.gcp.auth:Setting socket default timeout to 60 seconds. INFO:apache_beam.internal.gcp.auth:socket default timeout is 60.0 seconds. WARNING:loader_pipeline.sinks:Assuming grib edition 1. INFO:rasterio._env:GDAL signalled an error: err_no=4, msg='/var/tmp/tmphj2fbdcj is a grib file, but no raster dataset was successfully identified.' ERROR:loader_pipeline.sinks:Unable to open file 'gs://gi_asset-ecmwf-ensemble-data/hres-sample/2020-01-01T00:00:00z-u100-v100-u10-v10-u200-v200-2t-2d-ssr-str-sp-msl-tprate-ptype-blh-sr-tp.gb': /var/tmp/tmphj2fbdcj is a grib file, but no raster dataset was successfully identified. Traceback (most recent call last): File "rasterio/_base.pyx", line 302, in rasterio._base.DatasetBase.init File "rasterio/_base.pyx", line 213, in rasterio._base.open_dataset File "rasterio/_err.pyx", line 217, in rasterio._err.exc_wrap_pointer rasterio._err.CPLE_OpenFailedError: /var/tmp/tmphj2fbdcj is a grib file, but no raster dataset was successfully identified.

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "weather_mv/weather-mv", line 74, in cli(['--extra_package', pkg_archive]) File "/home/jupyter/weather-tools/weather_mv/loader_pipeline/init.py", line 23, in cli pipeline(run(sys.argv + extra)) File "/home/jupyter/weather-tools/weather_mv/loader_pipeline/pipeline.py", line 71, in pipeline paths | "MoveToBigQuery" >> ToBigQuery.from_kwargs(vars(known_args)) File "/home/jupyter/weather-tools/weather_mv/loader_pipeline/sinks.py", line 55, in from_kwargs return cls({k: v for k, v, in kwargs.items() if k in fields}) File "", line 17, in init File "/home/jupyter/weather-tools/weather_mv/loader_pipeline/bq.py", line 154, in post_init__ with open_dataset(self.first_uri, self.xarray_open_dataset_kwargs, File "/opt/conda/envs/weather-tools/lib/python3.8/contextlib.py", line 113, in enter__ return next(self.gen) File "/home/jupyter/weather-tools/weather_mv/loader_pipeline/sinks.py", line 399, in open_dataset with rasterio.open(local_path, 'r') as f: File "/opt/conda/envs/weather-tools/lib/python3.8/site-packages/rasterio/env.py", line 442, in wrapper return f(args, kwds) File "/opt/conda/envs/weather-tools/lib/python3.8/site-packages/rasterio/init.py", line 277, in open dataset = DatasetReader(path, driver=driver, sharing=sharing, kwargs) File "rasterio/_base.pyx", line 304, in rasterio._base.DatasetBase.init rasterio.errors.RasterioIOError: /var/tmp/tmphj2fbdcj is a grib file, but no raster dataset was successfully identified.

`

The sample file mentioned can be read without any issues locally using cfgrib open datasets. Please help in troubleshooting/fixing the above issue with weather-mv

alxmrs commented 1 year ago

Hey Hemanand! I think what's going on here is that we're trying to user rasterio (and thus GDAL) to parse the CRS information for the data. Searching around, it seems like GDAL may not support the type of grid that these files are using.

Can you run grib_ls and paste it here?

I think the best path forward to fix this issue for weather-mv bq would be to catch the rasterio error and set default projection information and then return the dataset. Projection info is not relevant for this data sink anyway.

heyanand commented 1 year ago

Hey Alex, Attached the file here.

grib_ls.pdf