openghg / openghg_inversions

University of Bristol Atmospheric Chemistry Research Group RHIME Inversion code (with openghg dependency)
MIT License
5 stars 0 forks source link

mf_error from repeatability and variability #212

Open joe-pitt opened 1 month ago

joe-pitt commented 1 month ago

Based on teams discussion with @hdelongueville and @brendan-m-murphy on 2024-10-04:

joe-pitt commented 1 month ago

From discussion 2024-10-10:

hdelongueville commented 1 month ago

For nans in mf_repeatability, how should we proceed when mf_repeatability is only nans during the inversion period? As it is the case, for example, of CBW data during 2018-01 and 2018-02 (see figure below).

If we implement the option to fill nan values with the mean (or median or twice the mean), we could

  1. first, calculate the mean on the inversion period (e.g. one month)
  2. if there isn't any values for that period, calculate the mean on all the available data?

Any thoughts?

image

joe-pitt commented 1 month ago

I think it would be good to be able to choose between different options (specified in the ini file). But I think the ones suggested above would be a good start. Maybe you could specify in the ini whether you wanted the mean or median, and whether you wanted to upscale by some factor (e.g. 2)?

hdelongueville commented 1 month ago

I agree, I guess my question is on what time period do we want to calculate those metrics?

joe-pitt commented 1 month ago

I like your idea of doing it within the inversion period first, then looking at the whole dataset. That way it deals with both the odd missing value and cases like CBW above. For this CBW data this is really no good way of calculating the missing repeatability, so we might as well do something simple.

hdelongueville commented 1 month ago

What do you suggest instead for CBW?

joe-pitt commented 1 month ago

Sorry I meant to say "there is really no good way"... So I am endorsing your suggestion!

hdelongueville commented 1 month ago

In the add_obs_error function (line 35 of get_data.py), when both mf_repeatability and mf_variability are present, but add_averaging_error is set to False, the code assigns mf_error to mf_repeatability. This approach does not consider mf_variability. Is this really what we want?

joe-pitt commented 1 month ago

Hmmm I guess "add_averaging_error" might not be the best name for this as it currently stands. I guess one could imagine some scenario where we didn't think mf_variability was a good proxy for model error and so we only wanted to consider the repeatability. But in our cases I think we always want both

hdelongueville commented 1 month ago

Yes I agree with you. I'll investigate and ask the team if this has been set up this way for a particular reason.

hdelongueville commented 1 month ago

From discussion 2024-10-23, Brendan's reply:

There probably isn't a case where we'd want to ignore variability. That option has been in inversions for a long time, so I kept it when I updated this part of the code.

Maybe at some point we will want an options to make mf_error = np.sqrt(sum(err**2 for err in list_of_err_arrays)) so we can use other types of uncertainties. Then maybe you could specify what goes into list_of_err_arrays

By default list_of_err_arrays could either be [ds.mf_variability] or [ds.mf_variability, ds.mf_repeatability] if repeatability is present.

In the ini, you could probably have something like uncertainties = ["mf_variability", "mf_repeatabilty"] then use mf_error = np.sqrt(sum(ds[err]**2 for err in uncertainties))

Anyway, for now it's probably fine to remove add_averaging_error and maybe just add logging messages to say what uncertianties are used

(or maybe... just leave add_averaging_error in, but print a warning saying it is ignored if it is set to False . That might make it easier to use old ini files)

hdelongueville commented 1 month ago

I think Brendan's solution is a really good idea, as it allows more flexibility and traceability. So, I'll add a warning for now, knowing that everything will be updated "soon" once OpenGHG supports other types of uncertainty.

hdelongueville commented 1 month ago

From the discussion above,

If we implement the option to fill nan values with the mean (or median or twice the mean), we could

  1. first, calculate the mean on the inversion period (e.g. one month)
  2. if there isn't any values for that period, calculate the mean on all the available data?

@brendan-m-murphy Do you know if there is an easy way to get the whole time period available for a specific site in fp_all (which is a dictionary of ModelScenario objects, keyed by site names)? Can I just simply do for example fp_all['cbw'].metadata['start_date'] OR fp_all['cbw'] has no metadata and I need to call openghg.retrieve.search_surface?

brendan-m-murphy commented 4 weeks ago

Hi @hdelongueville, sorry I missed this.

Is this assuming that fp_all has already been created? The thing stored at fp_all["CBW"] is the xr.Dataset returned by the method footprints_data_merge in ModelScenario. That function will raise an error if there are no overlapping time points between the obs and the footprint, so I'm not sure if option 2 can happen if NaNs are dropped before passing the ObsData object to ModelScenario. (The error is actually raised around lines 729-730 in _scenario.py in the function _align_obs_footprint.)

But, if there is a case where this happens, then you would need to go back and search for the obs data again.

Also, fp_all["CBW"] doesn't have a .metadata attribute because it is just a xr.Dataset.