openghg / openghg

A cloud platform for greenhouse gas (GHG) data analysis and collaboration.
https://www.openghg.org
Apache License 2.0
30 stars 4 forks source link

get_obs_surface Returning Empty Data Due to Unused Variables with NaN Values #1121

Open hdelongueville opened 2 months ago

hdelongueville commented 2 months ago

What is your issue?

The function get_obs_surface is returning data with no values in it. This issue is caused by the presence of variables filled with NaN values, which are not used in the process.

For example, in the case where there is a variable sttb full of NaNs, all the values are dropped because of that.

A quick fix is to set keep_missing=True, to skip the step that drops the NaNs. What is the best long term solution though?

joe-pitt commented 2 months ago

Here's a suggestion for dealing with this within inversions:

  1. Add capability to use things other than mf_repeatability in the model-data mismatch uncertainty calculation (see Issue #1120 )
  2. Drop any variables that aren't being used prior to dropping the nans (within inversion code)
rt17603 commented 1 month ago

From the openghg get_obs_surface function this is the relevant functionality:

        # Resampling may introduce NaNs, so remove, if not keep_missing
        if keep_missing is False:
            ds_resampled = ds_resampled.dropna(dim="time")

https://github.com/openghg/openghg/blob/devel/openghg/retrieve/_access.py#L407

So this is just dropping the time point if there are NaN values for any data variables at the moment. What we could do to help with this would be to specify a subset input for dropna to include a list of specific variables to check e.g. something like:

check_nan_subset = ["mf", "mf_repeatability", "mf_variability", ...]
...
        ds_resampled = ds_resampled.dropna(dim="time", subset=check_nan_subset )

And we could make it so that subset could be specified by the user with some useful default as shown above.

The question would be: what defaults would be sensible for this and cover enough cases?

rt17603 commented 1 month ago

The function get_obs_surface is returning data with no values in it.

From this it also sounds like there's a second issue as well around get_obs_surface returning empty data which may not be helpful. As an alternative, this something that could be checked for and an error raised rather than returning the data? Would that be preferable?

joe-pitt commented 1 month ago
check_nan_subset = ["mf", "mf_repeatability", "mf_variability", ...]
...
        ds_resampled = ds_resampled.dropna(dim="time", subset=check_nan_subset )

And we could make it so that subset could be specified by the user with some useful default as shown above.

The question would be: what defaults would be sensible for this and cover enough cases?

This sounds like a good plan to me. The default should definitely include mf - maybe that is the only one that is absolutely essential?

rt17603 commented 1 month ago

Did anyone have a quick example of the data that caused this problem by the way? Would be useful to have to be able to add a check in for this.

rt17603 commented 1 month ago

Would it be possible to get an example of this data? Would be useful to allow checks to be added for these issues.

hdelongueville commented 1 month ago

I currently only have access to the object store. @joe-pitt, think this might be something you could help with?

joe-pitt commented 1 month ago

An example would be the files in: /group/chem/acrg/obs_raw/EYE-AVE-PAR/EYE-AVE-PAR_2.2. Many of these have nans for things like LTR, STTB and Unc_n2o. At the moment the icos standardise function only loads unc_n2o (at one point I experimented with adding the others, hence sttb is mentioned in original post). See also this issue: https://github.com/openghg/openghg_inversions/issues/212