Review how missing data is handled in inversions

brendan-m-murphy commented 2 months ago

If no footprint or obs data is found for a site, that site is dropped. This works well for monthly inversions, but less so for annual inversions.

@joe-pitt had some runs where NaNs due to missing footprint data were propegated to H and caused a confusing pymc error.

Possible improvements could include:

adding options to fill or drop NaN values
modifying the code that drops sites when data isn't found so that it works for annual runs (might be more difficult?)

Also, we should investigate what happens if some obs data is missing. Typically footprint data won't be missing, but we often have missing obs, and the runs work despite this.

joe-pitt commented 2 months ago

Quick update on this - if a month of fp data from the middle of the dataset is missing from the object store (e.g. you have April 2016 and June 2016, but you are missing May 2016) you get the confusing pymc error: pymc.exceptions.SamplingError: Initial evaluation of model at starting point failed!

If there is a month of fp data missing at the end of the dataset (e.g. MHD obs run to November 2023, but the MHD footprints only run to October 2023) you get a different error:

File "/user/home/zh21490/openghg_inversions/openghg_inversions/hbmcmc/inversionsetup.py", line 100, in addaveragingerror
    fp_all[site.upper()].mf_repeatability.values ** 2
ValueError: operands could not be broadcast together with shapes (870,) (1,943)

In this case there are no nans in fp_all, but the timeseries is truncated to the end of the fp data, which causes these subsequent issues. I think the first step would be to raise a more informative error when either of these cases occur, e.g. "Some observations could not be matched to footprints - check that all footprints are loaded in object store". Even better if it can tell you which footprints are missing!

brendan-m-murphy commented 2 months ago

If the error happens during "add averaging error", then that might be because the obs data is loaded again from the object store for this part, so probably obs were dropped when the ModelScenarios are made, and then we try to add some averaging error based on the full obs time series.

joe-pitt commented 2 months ago

Yes I think that is what is happening. So maybe it would run if averaging error was turned off, but by default I think we probably want it to fail, or at the very least give us a strong warning that some of the obs were removed due to missing fps

brendan-m-murphy commented 2 months ago

For now, you might just want to turn off "addaveragingerror", because when it loads the obs, it specifies an averaging period, so the averaging error of that data (over the same period) is zero... so I don't think it does anything. I haven't got around to fixing that, but there is an issue open https://github.com/openghg/openghg_inversions/issues/42

joe-pitt commented 2 months ago

OK cool thanks! Although in my case I found the missing fps so I'm running fine now

qq23840 commented 2 months ago

I'm having this issue as well @brendan-m-murphy : it looks like when I have times that have footprints/no obs, no problem but if there's obs/no footprints I get the same issue as joe (pymc.exceptions.SamplingError: Initial evaluation of model at starting point failed!). Ideally we'd want a way of skipping the timestamps without fp in the inversion (presuming this is what happens when there are no obs?). Any idea where that would be best?

Looks related to #96 with the logp -inf issue, as well

brendan-m-murphy commented 2 months ago

It should be possible to fix this using 'dropna("time")' on the model scenarios before they're converted to numpy. I'll have a look Monday.

openghg / openghg_inversions

Review how missing data is handled in inversions #97