pySTEPS / pysteps

Python framework for short-term ensemble prediction systems.
https://pysteps.github.io/
BSD 3-Clause "New" or "Revised" License
467 stars 169 forks source link

Implement a data model #12

Open pulkkins opened 6 years ago

pulkkins commented 6 years ago

Pysteps operates on 2d grids that contain metadata (such as the grid definition, units, applied transformations and temporal resolution). Currently this information is carried in a separate metadata dictionary or as separate arguments (see e.g. pysteps.nowcasts.steps.forecast) in a very inconsistent way. It would be a good idea to store all this information into the same object.

Possible solution: use xarray (http://xarray.pydata.org/en/stable) that allows storing metadata into attributes, as well as using named columns and integration with dask.

dnerini commented 6 years ago

This is definitely needed and an excellent idea!

The use of xarray can indeed represent an ideal solution to this issue, at the cost of introducing an additional external dependency. xarray also offers a natural interface to netCDF files, pysteps output format.

However, this is not an easy task and will require considerable work. @pulkkins how do you plan to introduce such data model?

aperezhortal commented 6 years ago

May I suggest another alternative? Instead of using an xarray object, a class can be used to handle a collection of radar composite images. Internally, this class can store the radar 2D data, metadata, the array with the observations quality and the grid information. The benefit over an xarray object is that more method can be added to the class (for example the import, export, or maybe data preprocessing methods, etc.). In case that xarray need to be used, the class can include a method to export part of the data to a DataArray object.

This is how the class could look like (with an execution example at the end):

import numpy

from pysteps.io import (import_fmi_pgm, import_bom_rf3,
                        import_mch_gif, import_odim_hdf5)

class RadarComposites(object):
    """
    Class to store a collection of 2D precipitation fields.
    """

    def __init__(self):
        self.times = None  # Not used for now, but this could be useful to retrieve 2D fields at a given times.
        self.data = None  # precip data
        self.quality = None
        self.latitude = None  # Not used for now
        self.longitude = None  # Not used for now
        self.metadata = None

    def add(self, file_paths, file_format='fmi', **kwargs):
        """ Add radar 2D composites to the collection."""

        file_format = file_format.lower()

        available_importers = dict(bom_rf3=import_bom_rf3,
                                   bom=import_bom_rf3,
                                   fmi_pgm=import_fmi_pgm,
                                   fmi=import_fmi_pgm,
                                   mch_gif=import_mch_gif,
                                   odim_hdf5=import_odim_hdf5)

        if file_format not in available_importers:
            raise ValueError("Format not supported.\n"
                             + "The supported formats are: "
                             + str(list(available_importers.keys())))

        importer = available_importers[file_format]
        data = list()
        quality = list()
        metadata = list()

        for file in file_paths:
            _data, _quality, _metadata = importer(file, **kwargs)

            metadata.append(_metadata)

            data.append(_data)
            quality.append(_quality)

        data = numpy.asarray(data)
        quality = numpy.asarray(quality)
        if self.data is None:
            self.data = data
        else:
            self.data = numpy.concatenate((self.data, data), axis=0)

        if self.quality is None:
            self.quality = quality
        else:
            self.quality = numpy.concatenate((self.quality, quality), axis=0)

        # For now check that the metadata is the same for each time
        equal_metadata = numpy.all([metadata[0] == value
                                    for value in metadata])

        if not equal_metadata:
            print("Warning: Metadata differs for between some files!")

        self.metadata = metadata[0]

if __name__ == '__main__':
    #Usage example
    from os.path import join
    import glob

    data_dir = "../data/fmi/20170509"

    files = glob.glob(join(data_dir, '*.pgm.gz'))

    my_composites = RadarComposites()
    #Add composites to the collection
    my_composites.add(files, file_format='fmi', gzipped=True)

Another idea in the same line, is to create a sub-class from the xarray.Dataset class with the functionalities that I mention before.

dnerini commented 6 years ago

So far we have tried to avoid using classes. The reason was simply that we wanted to keep the code as simple and linear as possible. But as you nicely suggest, classes can offer a way to manage data and metadata, eventually improving the workflow. It is worth considering the option @aperezhortal suggests.

wcwoo commented 5 years ago

Hi everybody! I'm from the Hong Kong Observatory, currently responsible for the Community SWIRLS nowcast system, which is in operational use in meteorological services in China, India, Malaysia and South Africa. We're also working with those in the Philippines, Vietnam, Myanmar and soon other Southeast Asia countries.

From our (somewhat painful) experience, I fully agree that it is necessary to adopt a single and consistent data model, and that xarray is a good choice. You might be interested to know that we've recently developed some algorithms to read in radar data files and returns xarray.DataArray objects. Currently supported formats include ESRI ASCII, HDF5/ODIM, IRIS (raw and reflectivity), NetCDF (three variants used by Philippines and Vietnam), and UF. The project website is at https://com-swirls.org/. Kindly let me know if you would like to know more.

dnerini commented 5 years ago

Hi @wcwoo, many thanks for your comment. We can certainly learn a lot from your experience with the SWIRLS system! I think xarray still represents to us the best way to go and you are confirming us this impression.

I wasn't aware of the community version com-SWIRLS, I will certainly have a look at the project website that you've provided.

wcwoo commented 5 years ago

Sorry that I missed your reply, as I forgot that I setup an auto email sorting rule some time ago. :sweat_smile:

I wrote this page to describe Com-SWIRLS' data model. The gist is to make development divisible and modules swappable. I hope one day we can run various QPF algorithms simultaneously to benchmark performance or to generate super-ensemble.

kmuehlbauer commented 5 years ago

Is there some solution for this ahead? I would fully support the xarray approach. There is also no need to subclass Dataset but xarray Accessors can be used instead.

This would greatly simplify interaction with other packages.

Just today I had to go the long way importing/exporting data/results to/from pysteps to xarray to conveniently examine it using eg hvplot and feed it to my further processing chain.

If you consider xarray useful I would try to setup a Pull Request or example code.

kmuehlbauer commented 5 years ago

I've created a simple gist, which shows some capabilities of xarray and how it could be used for pysteps.

If there is real interest here, I can try to setup a simple workflow for some LK method etc. which uses xarray.pipe technique.

dnerini commented 5 years ago

Hi @kmuehlbauer ! Many thanks for your input and the very interesting example!

We have discussed adopting xarray in the past and I think the main point and concern was whether we wanted pysteps to remain a general library of nowcasting methods or instead to introduce a "strict" data structure such as xarray.

I think that there are pros and cons for both ways and that probably we are already a bit in between. Something that we currently really miss is a proper way to associate data and metadata in the workflow.

xarray offers extremely attractive solutions to this and other problems (e.g. interface to netCDF). Moreover, it appears to me that it is becoming somewhat of a standard in the weather radar community. This might further justify its adoption into pysteps.

This is certainly a topic we need to consider and discuss further and I'm really glad that someone with your experience can help us out to take the right decisions!

cvelascof commented 5 years ago

@kmuehlbauer yes. I also support the use of xarray. xarray is so good loading and saving netCDF and ploting variables as you are showing in your gist. However to wrap the core rutines of pySteps into xarray may require more work. Or maybe no, if xarray.pipe can help. I would be interested on see an example of the xarray.pipe tehcnique.

Also, as soon as I have a bit of spare time I will try to add some working examples on how connect some of pysteps verification rutines with xarray dataarrays. It really simplify a lot the code but on the other hand, it may obfuscate a bit the essence of the method.

kmuehlbauer commented 5 years ago

Thanks for your immediate feedback @dnerini, @cvelascof.

I might be a bit biased since I just did the implementation for xarray based polar data handling and processing within wradlib (CfRadial1, 2 and ODIM_H5). (side note: You can also read hdf5 using xarray, I'll add a example for reading HDF5 Odim later next week). And I found it extremely useful. The ability to use xarray Datasets from polar radar data over gridding to feeding into pysteps and other packages is crucial for my (hence the bias) workflow. There is also much momentum around xarray with respect to projection information and how to handle this across multiple projects.

Neat integration with dask is also crucial if you think of reprocessing years of data.

I would not directly speak of a "strict" structure. From my experience it is easy and versatile. And you get the netcdf output for free.

@cvelascof For xarray.pipe the called function need to take a Dataset/Dataarray as first argument and has to return a Dataset/Dataarray. From first glance this is compatible with most of pysteps calling and can be realised using simple checks. Still this will need some thinking.

Update: And, sometimes I forget this, you all have done a great job providing this useful package.

kmuehlbauer commented 5 years ago

Added simple LK workflow for GIF images from meteo swiss. Shows xarray DatasetAccessor feature and some xarray.pipe. Also xarray.rolling capability is shown.

dnerini commented 5 years ago

Trying to advance this issue, I've drafted a PR which aims at including some (partial) support for xarray within pysteps, as well as a tutorial (based on @kmuehlbauer's excellent gist) to show how xarray can be integrated into pysteps.

Any help and feedback is very welcomed!

PR: https://github.com/pySTEPS/pysteps/pull/123

wcwoo commented 5 years ago

I've been pondering how to make our modules more inter-operable. Imagine that we manage to standardize the data interface, then there would be no need for any wrapper. The user script could be slimmer and cleaner.

For example, wradlib's codes would take in a radar file and output an xarray (DataArray or Dataset), which then can be passed directly to pySTEPS' or SwirlsPy's modules for nowcasting. The outputs of pySTEP and SwirlsPy would also be in xarray and so could then be directly used for visualization, hydrological model or nowcast warning applications etc.

To achieve this, not only do we need xarray, but we also need to think if and how to standardize the data. There are questions like:

  1. What is stored in the xarray? e.g. reflectivity, rainfall rate, rainfall depth (over how long?)
  2. What is the unit of the above? Shall we adopt the CF convention?
  3. What does the timestamp represent? i.e. the start time or the end time of the observation window, or depending on the data? Currently, most satellite and radar are marked with the start time, but rain gauge data are marked with the end time, at least in my place.
  4. What is the dimension of the xarray? For nowcasting it probably starts at 2D+time, but before long we want to do 3D+time, either for better use of radar data or for blending with NWP. Then for probabilistic nowcast we would need to add one more dimension for "scenarios". So, is 5D the maximum?
kmuehlbauer commented 5 years ago

To achieve this, not only do we need xarray, but we also need to think if and how to standardize the data. There are questions like:

1. What is stored in the xarray?  e.g. reflectivity, rainfall rate, rainfall depth (over how long?)

This is something the user should decide. It's just a question of how much magic you want to put in the implementation. Keep it simple might be a good approach.

2. What is the unit of the above?  Shall we adopt the CF convention?

I would be in favour of adopting CF conventions. This is standard for most atmospheric science data.

3. What does the timestamp represent? i.e. the start time or the end time of the observation window, or depending on the data?  Currently, most satellite and radar are marked with the start time, but rain gauge data are marked with the end time, at least in my place.

This is implementation detail, for start_time aligned measurements the data is valid from start_time to next start_time, for end_time aligned measurements it's valid from prior end_time to current end_time. It depends on measurement frequency, so there need to be some alignment.

4. What is the dimension of the xarray?  For nowcasting it probably starts at 2D+time, but before long we want to do 3D+time, either for better use of radar data or for blending with NWP.  Then for probabilistic nowcast we would need to add one more dimension for "scenarios".  So, is 5D the maximum?

As long as we invent clear dimension naming scheme this should also be no big issue.

Just my 2C

wcwoo commented 5 years ago

This is something the user should decide. It's just a question of how much magic you want to put in the implementation. Keep it simple might be a good approach.

Agreed that we should keep it simple, at least for the start.

I would be in favour of adopting CF conventions. This is standard for most atmospheric science data.

Agreed.

This is implementation detail, for start_time aligned measurements the data is valid from start_time to next start_time, for end_time aligned measurements it's valid from prior end_time to current end_time. It depends on measurement frequency, so there need to be some alignment.

Indeed it would be clearer to use start_time and end_time, instead of just time. Some modules would have to deal with the change from start_time to end_time, e.g. when deriving rainfall amount from radar. Given that it is an already established convention to mark satellite & radar with start_time and rain gauge & lightning with end_time, it is perhaps better to keep things they are, and be careful while performing those steps.

As long as we invent clear dimension naming scheme this should also be no big issue.

Sure.

Recently we've received a support request from Cemaden (Brazil) to use Rainbow data with SwirlsPy. We could develop a swirlspy.rad.rainbow module, but may also take this opportunity to try implementing a common data model. @kmuehlbauer: I wonder if it is possible for you to develop a method in wradlib that takes in Rainbow data files and outputs an xarray.Dataset object? If so, we can develop a method in swirlspy to performing nowcasting with that object.

I'm tagging @SelokYeung and @cwjchin as they may help implement the changes in swirlspy.