stac-extensions / forecast

Common fields for (meteorological/weather) forecast data.
Apache License 2.0
10 stars 5 forks source link

Expressing uncertainty? #3

Open cboettig opened 1 year ago

cboettig commented 1 year ago

Thoughts about how to express uncertainty? To again use NOAA GEFS as an example, uncertainty is expressed through an ensemble identifier, which in effect acts as an additional dimension, similar to the way forecasting introduces a second time dimension. Other forecasts may express uncertainty as a mean - variance, other parameterizations, or not express uncertainty at all. In each case, these would be helpful things to capture in the metadata. In addition to ensemble id, NOAA also makes available an "ave" series with the average values across the model ensemble.

Along the lines of #2, some forecasters may also serialize this information into individual assets, e.g. as 'band' or additional dimensions in n-d array formats like ncdf. But all the same it would be nice to have available in metadata.

m-mohr commented 1 year ago

Sounds reasonable, but do you have a specific proposal in mind? For separate assets we can certainly specify a new role to use, but I'm not sure yet how to specify the other variant(s).

cboettig commented 1 year ago

Thanks. Within https://ecoforecast.org, we use the terms "family" (e.g. normal, lognormal, or 'sample' in the case of an ensemble) and "parameter" to specify which parameter of the distribution (or ensemble number) the particular asset refers to. So one proposal could be introducing optional terms forecast:family and forecast:parameter, though without a longer description it is not immediately obvious what those refer to.

A more narrow proposal would focus only on ensemble forecasts, e.g. introduce the term forecast:ensemble_id or such, indicating which ensemble member the asset referred to.

Just putting these out there as tentative suggestions, more input is probably needed before settling on the best mechanism.

chris-little commented 1 year ago

@cboettig @m-mohr It would be helpful to know what use cases are envisaged for identifying, for example, individual forecast ensemble members. In my experience of operational practice, individual ensemble members are usually only identified for further processing to produce more useful assets for an end-user. Of course, an expert, or even an AI/ML, may then identify a specific member as 'best' for a certain user.

An end user would probably find a mean, median, quartile, or even percentile figures for the total ensemble more useful.

HTH

cboettig commented 1 year ago

Thanks @chris-little . I agree that end users will often be interested in the summary statistics you mention, and supporting those would be great! As you've also noted, I think there's a growing audience of users & tools that will want to consume ensemble-based forecasts (e.g. this is supported in various user-friendly ML toolkits, like darts).

At the same time, I envision that a STAC metadata standard would be able to describe forecast uncertainty as it is currently represented in the existing assets already produced (not being from the meteorology world, GEFS product is my go-to reference case), which of course are already representing it in ensemble terms. So it would nice to be able to have a metadata format that is rich enough to support both terms.

One key application of having ensemble based assets properly described in metadata is in supporting downstream tools for probabilistic scoring. Given a STAC-catalogue of different forecast methods that might try to predict the same thing, what could be more natural than wanting to compare skill scores? I think the meteorology community is ahead of most others there, but it would be great to see more ML applications go beyond RMSE or quartiles to support strictly proper scores. (I think @aaronspring has some nice examples in climpred). As you know, having ensemble (or parametric distributions) is necessary here.

For our use cases in ecology, forecast uncertainty is sometimes better described in terms of parametric distributions (exponential, Poisson, etc). Obviously asymmetric and heavy-tailed distributions are not well represented by the more generic summary statistics.

m-mohr commented 1 year ago

What I'm doing right now in my implementation (and that's probably not ideal as discussed in #8), but gdalinfo gives me bands for GRIB2 files and for each of the bands I can add statistics to them via the raster extension: https://github.com/stac-extensions/raster#statistics-object

Edit: I just learned that this is not ideal. We should use the data cube extension and add a way there to add statistics similarly.

rqthomas commented 9 months ago

It would be great to revisit @cboettig's proposal for adding support for expressing uncertainty. There is definite use case because there are many ensemble-based forecasts that can potentially be described using this STAC extension. We are doing a lot work in the ecological forecasting space with ensemble forecasts. We are using two options: 1) family and parameter columns or 2) ensemble_id. The first is for parametric uncertainty where the uncertainty is represented using parameters of a distribution (mu and sigma for normal, rate and shape for gamma, etc.). The second is for ensemble forecasts (like the NOAA Global Ensemble Forecasting System) but an ensemble forecast can also be represented using the first (family = sample, parameter = sample id number).

m-mohr commented 9 months ago

Please go ahead and create a proposal as PR :-)