Open joe-pitt opened 1 month ago
From discussion 2024-10-10:
For nans in mf_repeatability, how should we proceed when mf_repeatability is only nans during the inversion period? As it is the case, for example, of CBW data during 2018-01 and 2018-02 (see figure below).
If we implement the option to fill nan values with the mean (or median or twice the mean), we could
Any thoughts?
I think it would be good to be able to choose between different options (specified in the ini file). But I think the ones suggested above would be a good start. Maybe you could specify in the ini whether you wanted the mean or median, and whether you wanted to upscale by some factor (e.g. 2)?
I agree, I guess my question is on what time period do we want to calculate those metrics?
I like your idea of doing it within the inversion period first, then looking at the whole dataset. That way it deals with both the odd missing value and cases like CBW above. For this CBW data this is really no good way of calculating the missing repeatability, so we might as well do something simple.
What do you suggest instead for CBW?
Sorry I meant to say "there is really no good way"... So I am endorsing your suggestion!
In the add_obs_error function (line 35 of get_data.py), when both mf_repeatability and mf_variability are present, but add_averaging_error is set to False, the code assigns mf_error to mf_repeatability. This approach does not consider mf_variability. Is this really what we want?
Hmmm I guess "add_averaging_error" might not be the best name for this as it currently stands. I guess one could imagine some scenario where we didn't think mf_variability was a good proxy for model error and so we only wanted to consider the repeatability. But in our cases I think we always want both
Yes I agree with you. I'll investigate and ask the team if this has been set up this way for a particular reason.
From discussion 2024-10-23, Brendan's reply:
There probably isn't a case where we'd want to ignore variability. That option has been in inversions for a long time, so I kept it when I updated this part of the code.
Maybe at some point we will want an options to make mf_error = np.sqrt(sum(err**2 for err in list_of_err_arrays)) so we can use other types of uncertainties. Then maybe you could specify what goes into list_of_err_arrays
By default list_of_err_arrays could either be [ds.mf_variability] or [ds.mf_variability, ds.mf_repeatability] if repeatability is present.
In the ini, you could probably have something like uncertainties = ["mf_variability", "mf_repeatabilty"] then use mf_error = np.sqrt(sum(ds[err]**2 for err in uncertainties))
Anyway, for now it's probably fine to remove add_averaging_error and maybe just add logging messages to say what uncertianties are used
(or maybe... just leave add_averaging_error in, but print a warning saying it is ignored if it is set to False . That might make it easier to use old ini files)
I think Brendan's solution is a really good idea, as it allows more flexibility and traceability. So, I'll add a warning for now, knowing that everything will be updated "soon" once OpenGHG supports other types of uncertainty.
From the discussion above,
If we implement the option to fill nan values with the mean (or median or twice the mean), we could
- first, calculate the mean on the inversion period (e.g. one month)
- if there isn't any values for that period, calculate the mean on all the available data?
@brendan-m-murphy Do you know if there is an easy way to get the whole time period available for a specific site in fp_all (which is a dictionary of ModelScenario
objects, keyed by site names)? Can I just simply do for example fp_all['cbw'].metadata['start_date'] OR fp_all['cbw'] has no metadata and I need to call openghg.retrieve.search_surface
?
Hi @hdelongueville, sorry I missed this.
Is this assuming that fp_all
has already been created? The thing stored at fp_all["CBW"]
is the xr.Dataset returned by the method footprints_data_merge
in ModelScenario
. That function will raise an error if there are no overlapping time points between the obs and the footprint, so I'm not sure if option 2 can happen if NaNs are dropped before passing the ObsData
object to ModelScenario
. (The error is actually raised around lines 729-730 in _scenario.py
in the function _align_obs_footprint
.)
But, if there is a case where this happens, then you would need to go back and search for the obs data again.
Also, fp_all["CBW"]
doesn't have a .metadata
attribute because it is just a xr.Dataset
.
Based on teams discussion with @hdelongueville and @brendan-m-murphy on 2024-10-04:
VarT = (1/N) sum[nj(Varj + SqrMeanj)] - SqrMeanT