pySTEPS / pysteps

Python framework for short-term ensemble prediction systems.
https://pysteps.github.io/
BSD 3-Clause "New" or "Revised" License
466 stars 168 forks source link

pysteps.io with xarray #219

Closed dnerini closed 3 years ago

dnerini commented 3 years ago

Include new xarray-based data model into the pysteps.io module (See #12).

The imported data are converted into an xarray Dataset by means of a decorator.

See below for an example using MeteoSwiss data.

> precip_ds = pysteps.io.import_mch_gif(filename, "AQC", 5.0)
> precip_ds.info()

xarray.Dataset {
dimensions:
        y = 640 ;
        x = 710 ;

variables:
        float64 precipitation(y, x) ;
                precipitation:standard_name = precipitation_rate ;
                precipitation:long_name = Precipitation product ;
                precipitation:product = AQC ;
                precipitation:unit = mm ;
                precipitation:accutime = 5.0 ;
                precipitation:transform = None ;
                precipitation:zerovalue = 0.0 ;
                precipitation:threshold = 0.0009628129986471908 ;
                precipitation:zr_a = 316.0 ;
                precipitation:zr_b = 1.5 ;
        float64 x(x) ;
                x:standard_name = projection_x_coordinate ;
                x:units = m ;
        float64 y(y) ;
                y:standard_name = projection_y_coordinate ;
                y:units = m ;

// global attributes:
        :institution = MeteoSwiss ;
        :projection = +proj=somerc  +lon_0=7.43958333333333 +lat_0=46.9524055555556 +k_0=1 +x_0=600000 +y_0=200000 +ellps=bessel +towgs84=674.374,15.056,405.346,0,0,0,0 +units=m +no_defs ;
}

the same for a BOM file:

xarray.Dataset {
dimensions:
        y = 512 ;
        x = 512 ;

variables:
        float64 precipitation(y, x) ;
                precipitation:standard_name = precipitation_rate ;
                precipitation:long_name = Precipitation product ;
                precipitation:product = None ;
                precipitation:unit = mm ;
                precipitation:accutime = 6 ;
                precipitation:transform = None ;
                precipitation:zerovalue = 0.0 ;
                precipitation:threshold = 0.05 ;
                precipitation:zr_a = None ;
                precipitation:zr_b = None ;
        float64 x(x) ;
                x:standard_name = projection_x_coordinate ;
                x:units = m ;
        float64 y(y) ;
                y:standard_name = projection_y_coordinate ;
                y:units = m ;

// global attributes:
        :institution = Commonwealth of Australia, Bureau of Meteorology ;
        :projection = +proj=aea  +lon_0=144.752 +lat_0=-37.852 +lat_1=-18.000 +lat_2=-36.000 ;
}

edit: limit the scope of this PR to the io module only.

dnerini commented 3 years ago

Spontaneous questions:

cvelascof commented 3 years ago

Spontaneous responses:

* should we use "precipitation" to name the main variable? Can we make it more general so that to make it clear that other variables can be used (e.g. cloud cover)?

at this point I believe that "precipitation" is a good choice, but I agree that we may have other cases such "rain_rate"

* we should revise the metadata to better align them to CF standards and other libraries used for radar data processing (e.g. wradlib).

Yes ... it is better to follow CF standards as xarray use them as well

but I would prefer to keep key metadata in the structure as parameters of the variable ... because those metadata that initially would easily derived from the data may be lost after a number of data transformations, or they could not be that easy to derive, for example if radar field is full of non zero_values then how the zero_value can be derived?

dnerini commented 3 years ago

OK, for the moment I decided to return a DataArray instead of a Dataset (hence ignoring the quality field, which we never used anyway). This way, we remain agnostic in what variable we are reading (could be anything).

 <xarray.DataArray (y: 640, x: 710)>
array([[nan, nan, nan, ..., nan, nan, nan],
       [nan, nan, nan, ..., nan, nan, nan],
       [nan, nan, nan, ..., nan, nan, nan],
       ...,
       [nan, nan, nan, ..., nan, nan, nan],
       [nan, nan, nan, ..., nan, nan, nan],
       [nan, nan, nan, ..., nan, nan, nan]])
Coordinates:
  * x        (x) float64 2.555e+05 2.565e+05 2.575e+05 ... 9.635e+05 9.645e+05
  * y        (y) float64 -1.595e+05 -1.585e+05 ... 4.785e+05 4.795e+05
Attributes:
    unit:         mm
    accutime:     5.0
    transform:    None
    zerovalue:    0.0
    threshold:    0.0009628129986471908
    zr_a:         316.0
    zr_b:         1.5
    institution:  MeteoSwiss
    projection:   +proj=somerc  +lon_0=7.43958333333333 +lat_0=46.95240555555...

I was thinking that it would make more sense if the DataArray would include a singleton "t" dimension with the timestamp of the radar image, what you think?

cvelascof commented 3 years ago

my personal preference is to use Dataset from source input as stores together multiple variables, like 'precipitation' and 'projection' information. The 'main/interset' variable always can be selected from the Dataset before to pass into pySTEPS routines.

Also, "time" should be a coordinate as that is what really differentiate one radar field from other one (assuming that all of them share common metadata for all time steps) ... also using 'time' as coordinate helps to open and concatenate multiple radar files using xarray.open_mfdataset

dnerini commented 3 years ago

I see your point concerning the dataset, @cvelascof, and this could be easily implemented for importers. I just wonder the need for datasets when we mostly work with single variables (radar precip). One can always build a new dataset when needed (adding NWP data for example).

And do we need an extra variable for the projection? I've seen it done before for netcdfs, but we could try to simplify this by adopting EPSG codes, at which point it would be enough to add a crs=EPSG:xxxx attribute. What do you think?

also most of our methods work on single arrays, and thus the migration on xarrays would be easier on DataArrays. Also, we shouldn't make any assumption on the name of the variables.

One possible solution could be to return a dataset when importing but then the user would have to pass the variable of interest to a given method, for example:

ds = pysteps.io.read_timeseries() # or xr.open_mfdataset()
precip = ds.precip.pysteps.to_rainrate()
motion = pysteps.motion.dense_lucaskanade(precip)
RubenImhoff commented 3 years ago

That could work, @dnerini (thanks for all the work, by the way)! We only should not forget to change that in the examples in that case (to avoid confusion). @cvelascof, does projection need to be a variable? I.e., can't it just be an attribute as in @dnerini's example?

I find it hard to judge whether we need to move to DataArrays or -Sets. As @dnerini pointed out, we generally work with single arrays. What I can imagine with our move towards blending and perhaps later machine learning techniques, is that we want to involve (at some point..) more variables than just precipitation. If the information originates from one dataset, e.g. the NWP model, it may be a bit double to import that in multiple DataArrays that contain similar grid, time and metadata information. However, either way will work and I think we just have to decide on what we see as both a clearer and cleaner method. I personally have no preference.

dnerini commented 3 years ago

This PR is by no means definitive, but should be seen instead as a first step to integrate the xarray data model into pysteps. I suggest we merge it pretty soon and advance in other modules, knowing that we will anyway have to come back to the io to make more adjustments before V2 will be ready.

aperezhortal commented 3 years ago

Great work @dnerini! I agree that we should merge it soon and advance in other modules.

I pushed two commits that add a temporary "legacy" option to allow the importers and readers to behave as in version 1. Also, the tests that are still written for v1, were updated to use this new legacy option. The tests that couldn't be easily fix were manually skipped and tagged with a TODO message. This legacy option will allow maintaining the testing functionality through the migration to xarray.

dnerini commented 3 years ago

Ah very nice @aperezhortal , thanks! Looks like tests are now failing because of missing dependencies. I'll cherry-pick commit b4c165ade1f31fb7567457a6bd95686702d8f70a to fix it as soon as possible (should have some time this afternoon).

codecov[bot] commented 3 years ago

Codecov Report

Merging #219 (15d5b5b) into pysteps-v2 (c8ebc2d) will decrease coverage by 8.33%. The diff coverage is 96.55%.

Impacted file tree graph

@@              Coverage Diff               @@
##           pysteps-v2     #219      +/-   ##
==============================================
- Coverage       79.84%   71.51%   -8.34%     
==============================================
  Files             137      139       +2     
  Lines            9929     9988      +59     
==============================================
- Hits             7928     7143     -785     
- Misses           2001     2845     +844     
Flag Coverage Δ
unit_tests 71.51% <96.55%> (-8.34%) :arrow_down:

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
pysteps/io/readers.py 81.25% <72.72%> (+7.05%) :arrow_up:
pysteps/tests/helpers.py 87.50% <87.50%> (-6.62%) :arrow_down:
pysteps/decorators.py 99.28% <100.00%> (+0.14%) :arrow_up:
pysteps/io/importers.py 71.77% <100.00%> (+0.12%) :arrow_up:
pysteps/tests/test_cascade.py 47.22% <100.00%> (-52.78%) :arrow_down:
pysteps/tests/test_datasets.py 45.45% <100.00%> (-26.64%) :arrow_down:
pysteps/tests/test_detcatscores.py 94.73% <100.00%> (-5.27%) :arrow_down:
pysteps/tests/test_exporters.py 31.91% <100.00%> (-68.09%) :arrow_down:
pysteps/tests/test_io_archive.py 100.00% <100.00%> (ø)
pysteps/tests/test_io_bom_rf3.py 100.00% <100.00%> (ø)
... and 35 more

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update c8ebc2d...15d5b5b. Read the comment docs.