mf_error from repeatability and variability

joe-pitt commented 1 month ago

Based on teams discussion with @hdelongueville and @brendan-m-murphy on 2024-10-04:

When resampling mf_variability we want to combine the variability within each obs averaging period (e.g. 1 min or 1 hr) with the variability of the averages within each inversion period (e.g. 4 hr). Here is Alistair's suggestion: VarT = (1/N) sum[nj(Varj + SqrMeanj)] - SqrMeanT
This calculation needs to handle nans in the reported mf_variability (i.e. in that case it just takes the stdev of the averages within each 4 hr period)
We also need to handle situations where mf_repeatability exists in the dataset, but contains nans. One possibility would be: a) if mf_repeatability is entirely nan, drop it from the dataset b) if mf_repeatability contains some real values, set the nans to the mean or median of these real values prior to resampling

joe-pitt commented 1 month ago

From discussion 2024-10-10:

Try implementing Alistair's suggestion for resampling variability (openghg issue)
Also need to handle Medusa case where there are only 1 or 2 points in a 4H window. Need to be able to specify the use of a wider window for variability calc (openghg and inversions issue)
Build in flexible options (specified in ini file) for what to do in case of nans in mf_repeatability - default is to throw an error, but ideally we could specify different functions to use, or chose to drop nans (inversions issue)
As part of this, we need to be able to specify which columns we want drop nans to be applied to (openghg issue)
For repeatability - need to think more about how this gets resampled. At the moment we assume the errors on successive measurements are independent and random, but that is not true in reality (see Issue #988). It's not immediately obvious how to do this better, but the current calculation leads to unrealistically small uncertainties in some cases. (openghg and inversions issue)
Try adding option to retrieve LTR etc from icos .nc files (see https://github.com/openghg/openghg/issues/1106). (openghg issue)
keep original names for uncertainty variables in the object store, and add option in openghg_inversions to specify which variable to use as the uncertainty (openghg and inversions issue)

hdelongueville commented 1 month ago

For nans in mf_repeatability, how should we proceed when mf_repeatability is only nans during the inversion period? As it is the case, for example, of CBW data during 2018-01 and 2018-02 (see figure below).

If we implement the option to fill nan values with the mean (or median or twice the mean), we could

first, calculate the mean on the inversion period (e.g. one month)
if there isn't any values for that period, calculate the mean on all the available data?

Any thoughts?

joe-pitt commented 1 month ago

I think it would be good to be able to choose between different options (specified in the ini file). But I think the ones suggested above would be a good start. Maybe you could specify in the ini whether you wanted the mean or median, and whether you wanted to upscale by some factor (e.g. 2)?

hdelongueville commented 1 month ago

I agree, I guess my question is on what time period do we want to calculate those metrics?

joe-pitt commented 1 month ago

I like your idea of doing it within the inversion period first, then looking at the whole dataset. That way it deals with both the odd missing value and cases like CBW above. For this CBW data this is really no good way of calculating the missing repeatability, so we might as well do something simple.

hdelongueville commented 1 month ago

What do you suggest instead for CBW?

joe-pitt commented 1 month ago

Sorry I meant to say "there is really no good way"... So I am endorsing your suggestion!

hdelongueville commented 1 month ago

In the add_obs_error function (line 35 of get_data.py), when both mf_repeatability and mf_variability are present, but add_averaging_error is set to False, the code assigns mf_error to mf_repeatability. This approach does not consider mf_variability. Is this really what we want?

joe-pitt commented 1 month ago

Hmmm I guess "add_averaging_error" might not be the best name for this as it currently stands. I guess one could imagine some scenario where we didn't think mf_variability was a good proxy for model error and so we only wanted to consider the repeatability. But in our cases I think we always want both

hdelongueville commented 1 month ago

Yes I agree with you. I'll investigate and ask the team if this has been set up this way for a particular reason.

hdelongueville commented 1 month ago

From discussion 2024-10-23, Brendan's reply:

There probably isn't a case where we'd want to ignore variability. That option has been in inversions for a long time, so I kept it when I updated this part of the code.

Maybe at some point we will want an options to make mf_error = np.sqrt(sum(err**2 for err in list_of_err_arrays)) so we can use other types of uncertainties. Then maybe you could specify what goes into list_of_err_arrays

By default list_of_err_arrays could either be [ds.mf_variability] or [ds.mf_variability, ds.mf_repeatability] if repeatability is present.

In the ini, you could probably have something like uncertainties = ["mf_variability", "mf_repeatabilty"] then use mf_error = np.sqrt(sum(ds[err]**2 for err in uncertainties))

Anyway, for now it's probably fine to remove add_averaging_error and maybe just add logging messages to say what uncertianties are used

(or maybe... just leave add_averaging_error in, but print a warning saying it is ignored if it is set to False . That might make it easier to use old ini files)

hdelongueville commented 1 month ago

I think Brendan's solution is a really good idea, as it allows more flexibility and traceability. So, I'll add a warning for now, knowing that everything will be updated "soon" once OpenGHG supports other types of uncertainty.

hdelongueville commented 1 month ago

From the discussion above,

If we implement the option to fill nan values with the mean (or median or twice the mean), we could

first, calculate the mean on the inversion period (e.g. one month)

if there isn't any values for that period, calculate the mean on all the available data?

@brendan-m-murphy Do you know if there is an easy way to get the whole time period available for a specific site in fp_all (which is a dictionary of ModelScenario objects, keyed by site names)? Can I just simply do for example fp_all['cbw'].metadata['start_date'] OR fp_all['cbw'] has no metadata and I need to call openghg.retrieve.search_surface?

brendan-m-murphy commented 4 weeks ago

Hi @hdelongueville, sorry I missed this.

Is this assuming that fp_all has already been created? The thing stored at fp_all["CBW"] is the xr.Dataset returned by the method footprints_data_merge in ModelScenario. That function will raise an error if there are no overlapping time points between the obs and the footprint, so I'm not sure if option 2 can happen if NaNs are dropped before passing the ObsData object to ModelScenario. (The error is actually raised around lines 729-730 in _scenario.py in the function _align_obs_footprint.)

But, if there is a case where this happens, then you would need to go back and search for the obs data again.

Also, fp_all["CBW"] doesn't have a .metadata attribute because it is just a xr.Dataset.

openghg / openghg_inversions

mf_error from repeatability and variability #212