openghg / openghg

A cloud platform for greenhouse gas (GHG) data analysis and collaboration.
https://www.openghg.org
Apache License 2.0
26 stars 4 forks source link

standardisation of footprints in new zarr store can't handle varying inputs #970

Open qq23840 opened 3 months ago

qq23840 commented 3 months ago

What happened?

Trying to standardise some GSN footprints for 2008-2022 into a zarr object stores. The raw netCDF files have slightly different set of variables for 2020-2022 (they contain mean_age_particles_n-type variables, which the pre-2020 footprints I have don't). If trying to standardise these two types of footprint into the same _zarr object store, I get an xarray ValueError about dimension sizes when trying to store the two footprints in the object store

What did you expect to happen?

In the old object stores, I could standardise these files alongside each other, but it's not working in the current setup. Can get around it by first dropping the offending variables, but obviously this is just a work around

Minimal Complete Verifiable Example

No response

Relevant log output

No response

Anything else we need to know?

No response

Environment

Python 3.11.7 OpenGHG v0.8.0 (devel branch)

qq23840 commented 3 months ago

Quick fix of dropping the extra variables also throws an error when using the get_footprint function; without start_date and end_date arguments the function looks like it works fine, but passing these gives a KeyError for the Timestamp which I can't decipher

qq23840 commented 3 months ago

Update - @gareth-j made a branch (benFix1) in which the time slicing doesn't happen in the openghg framework. Works OK for using the get_footprint functionality, but needs some tweaks to the openghg_inversions code to time slice the data appropriately when running an inversion (openghg_inversions.get_data.py, currently on local branch).

brendan-m-murphy commented 3 months ago

I'm not sure there's an easy way to do this since the difference is in the data variables instead of the metadata, but my first thought is that these should probably be stored as two different datasources, or at least in two separate zarr stores. In the old system, every year of data was stored independently, but with zarr, we're assuming that the data variables and coordinates known. It's possible that the new variables aren't being chunked properly, since the chunk sizes are based on the data already in the zarr store.

Are these new variables, or just different variable names? (I thought NAME footprints typically have mean particle age variables. If the species is inert, you don't need them though.)

qq23840 commented 3 months ago

They are new variables, I think - unless I'm looking at an old set of footprints, which is possible. I've taken them all from the shared area on bp1 but I'm not 100% sure of their status. In this set, pre-2020 doesn't have the variables at all, whereas they do exist for pre-2020. It's true that for inert species they're not needed, though, and that's the route I've gone down in doing a rough-and-ready fix.

gareth-j commented 2 months ago

Check the current Dataset in the zarr store for the existing variables. If they don't exist then fill in with NaNs.