Porting functionality - Githubissues

znicholls commented 3 years ago

Hi there,

I'm the developer of a package called netCDF-SCM. In short, the package does some of what CMIP6 pre-processing aims at, some things which CMIP6 pre-processing doesn't do and doesn't even try to do some of what you're tackling. We wrote a paper about it and the outputs here if it's of interest.

As you can tell if you look at the commit logs, we're not exactly actively maintaining netCDF-SCM. My question to you: if there is a plan to keep working on CMIP6 pre-processing over the long-term (say, years), how would you feel about porting some of the functionality from netCDF-SCM into CMIP6 pre-processing? It would take some effort (when either of us finds time, who knows), but there could be some useful shared learnings and code duplication as a result.

Cheers and congrats on a great package!

jbusecke commented 3 years ago

Hi @znicholls, thank you so much for pinging me about this.

Let me parse out some of the questions:

I'm the developer of a package called netCDF-SCM. In short, the package does some of what CMIP6 pre-processing aims at, some things which CMIP6 pre-processing doesn't do and doesn't even try to do some of what you're tackling. We wrote a paper about it and the outputs here if it's of interest.

This looks really cool, thanks for sharing!

My question to you: if there is a plan to keep working on CMIP6 pre-processing over the long-term (say, years)

I am planning to actively maintain cmip6_preprocessing for the foreseeable future, but I could really use some more eyes on the code, so that it does not completely consume my time hehe. If you are willing to put in some cycles, that would be amazing.

, how would you feel about porting some of the functionality from netCDF-SCM into CMIP6 pre-processing?

In general, I am very open to this idea, but I think we should very clearly define what functionality should/could be ported. I am trying to avoid feature creep by not writing a 'one-does-all' tool but instead rely on a combination of specialized tools and tutorials/documentation on how to use them together. For example, I am planning to rely more heavily on cf-xarray(#59), xgcm (for more complex c-grid analysis) and daops (for a more general approach to "fixes").

From a quick browse, I see the most overlap in the 'drift removal'/'normalization' functionality. I think implementing the rolling mean approach as another option (to a linear trend) could be a very useful starting point?

Another very interesting aspect about netCDF-SCM to me seems the functionality to check for errata issues and generating provenance metadata. I would very much like to understand more what you are doing in that regard.

What functionality would be on your top list to port? Could we make a list of features like this:	Only in cmip6_pp	overlap
- Fix naming?	- Remove control drift	- Errata detection
- Create ocean basin maps (using regionmask)	- dealing with cell metrics like area, depth, etc	- Conversion to other dataformats
- detecting retractions
- ...

and mark features we wish to port?

One of my main concerns regarding netCDF-SCM is that it is based on iris, while cmip6_pp is firmly based on xarray. I have zero experience with iris and think that xarray should stay at the core of cmip6_pp.

This was a long way of saying yes I guess 🤪.

znicholls commented 3 years ago

I think we should very clearly define what functionality should/could be ported

I completely agree.

I am trying to avoid feature creep by not writing a 'one-does-all' tool but instead rely on a combination of specialized tools and tutorials/documentation on how to use them together. For example, I am planning to rely more heavily on cf-xarray(#59), xgcm (for more complex c-grid analysis) and daops (for a more general approach to "fixes").

Yep this is awesome. I really felt like there must be a better way than writing so much stuff myself (unfortunately when I started zarr files weren't a thing and most other tools hadn't even started).

I think implementing the rolling mean approach as another option (to a linear trend) could be a very useful starting point?

Sure: the idea isn't super complicated, but there's a bit of learning that's happened along the way which can be helpful.

Another very interesting aspect about netCDF-SCM to me seems the functionality to check for errata issues and generating provenance metadata. I would very much like to understand more what you are doing in that regard.

This is also fairly basic, the idea is that a) you can query the ESGF to work out if a dataset has been marked as retracted fairly easily (so you don't use dud data in analysis) and b) using ESGF metadata, you can generate citation tables in an automated way (which is handy for papers). It might be the simplest thing to port, but it might also be out of scope for your package (I don't really know where it best sits).

One of my main concerns regarding netCDF-SCM is that it is based on iris, while cmip6_pp is firmly based on xarray. I have zero experience with iris and think that xarray should stay at the core of cmip6_pp.

A good concern, but one that I'm hoping won't be an issue. Most of the ideas in netCDF-SCM are pretty basic so writing the same thing in xarray instead would be pretty simple I think (I've played around with xarray a bit and haven't found any major issues yet). The value of netCDF-SCM is more in the learning, edge case handling and the tests rather than the implementation. (In the worst case that some piece of iris functionality is vital (I think it's very unlikely, netCDF-SCM doesn't do anything very complicated), we can always make it an optional dependency and convert to iris, do the operation, then convert back.)

What functionality would be on your top list to port?

To be honest, at the moment I just wanted to get an idea of scope and possibilities. Sadly I have no capacity to contribute actual code right now. Given that, I think I see the following options for moving forward:

close this issue and I'll re-open if/when I have any time
make the table you suggest so we at least have an overview of what would be useful to bring in and what is handled elsewhere
just leave the issue open, and we pick things up if/when we can

jbusecke commented 3 years ago

I think I would favor 2. and keeping this open for future reference. We could also add this to the docs, so that other folks are aware of netCDF-SCM?

znicholls commented 3 years ago

Sounds great, my attempt at the table below (maybe easier to do this in a PR so we can more easily build on each other's tables?).

Only in cmip6_pp	Overlap	Only in netCDF-SCM
Dealing with volume cell metrics	Naming fixes (although much better handled in cmip6_pp, latest stalled work in netCDF-SCM on branch time issues (I don't know the extent to which these issues are handled by cmip6_pp) is here)	Conversion to other data formats (and should probably stay that way, it's a very SCM focussed bit of functionality which could easily be wrapped around cmip6_pp in future)
Ocean basin masking	Remove piControl drift	Region masking e.g. AR6 region masking or country masking (also using regionmask plus natural earth, in theory netCDF-SCM could be fairly easily expanded to do ocean basins but that functionality isn't there at present, netCDF-SCM docs are here)
	Dealing with area cell metrics (although there are some tricks to this depending on context which are probably handled differently in the two packages, see docstring of `cell_weights` argument here or Section 2.2.1 of the paper)	Detecting retractions and basic data license checks
	Batch processing large numbers of files	Creating citation tables for papers
	Decoding data reference syntax(s) into meaningful names (not sure how cmip6_pp handles this but there must be something in there? netCDF-SCM can handle CMIP5 and CMIP6, probably only cmip6_pp can handle e.g. CORDEX or Obs4MIPs)	Calculating and removing piControl rolling-mean (also calculating anomalies against piControl rather than just removing linear trend i.e. turning absolute values into perturbations?)
		Handling of input4MIPs oddities (super niche so probably not worth porting, brief docs here)
		Joining experiments in the same family into a single timeseries (maybe also in cmip6_pp?)

znicholls commented 2 years ago

hey @jbusecke, I wanted to reach out again as I am going to get back into some of this work over the next few months. I was wondering if you had any strong thoughts on a way forward? My impression seems to be that:

cmip6_preprocessing is the right place for dealing with renaming issues (via the dependency packages) and dedrifting
the decisions about how to deal with cell areas should be left up to individuals as these depend on what someone is trying to do (having multiple packages which make different decisions about what cell areas to use is fine, rather than trying to unify all such assumptions)
the rest of the stuff in netCDF-SCM can either stay where it is or be turned into standalone packages where appropriate.

One other question: is daops the package which actually does the fixes or are they buried elsewhere? I was trying to find a list of fixes to see how many of the CMIP6 issues I've found are already covered (to get a sense of scope) but I really struggled to work out where to look.

jbusecke commented 2 years ago

Hey @znicholls,

I am very sorry for not responding here. Things were hectic to say the least.

I hope this answers still make some sense.

cmip6_preprocessing is the right place for dealing with renaming issues (via the dependency packages) and dedrifting

This sounds great. I am planning to work on the dedrifting again soon (#168), and will ping you in case you have some feedback. I would love if the future cmip6_pp would be able to do the rolling means and polynomial fits (cc @DamienIrving)

the decisions about how to deal with cell areas should be left up to individuals as these depend on what someone is trying to do (having multiple packages which make different decisions about what cell areas to use is fine, rather than trying to unify all such assumptions)

I would agree with this in the long term. I personally want to get this out of cmip6_pp, at least the core functionality (might keep a thin wrapper for convenience). This is much more of a general issue than just CMIP, so handling this upstream is appropriate.

the rest of the stuff in netCDF-SCM can either stay where it is or be turned into standalone packages where appropriate.

I have two things that might merit further discussion here: The ERRATA check and automatic reference generation and the Joining experiments in the same family into a single timeseries (maybe also in cmip6_pp?).

I just mentioned this in #215, I think that an errata check and the reference builder would be very useful to me and other cmip6_pp users in the future. I have no problem to leave those in netCDF-SCM, if they can be integrated easily in the API/workflow over here. Is there a minimal example for each of these that I can try? I might have missed that earlier.

For the joining. I have actually invested quite a bit of work into the postprocessing module, which allows very general 'combination' of custom sets of datasets. There are a bunch of preset wrappers like concat_experiments and merge_variables, but the underlying logic is very general and I would like to keep it here, since I have used it very successfully in the past months. It is however NOT well (or not at all) documented right now (https://github.com/jbusecke/cmip6_preprocessing/issues/190). You can have a look at what is documented here, and check out the overlapping concat_experiments.

One other question: is daops the package which actually does the fixes or are they buried elsewhere? I was trying to find a list of fixes to see how many of the CMIP6 issues I've found are already covered (to get a sense of scope) but I really struggled to work out where to look.

As far as I know there are no actual fixes for CMIP6 implemented over at daops, that would be something I would have to start probably haha. If you would like to be involved that would be awesome. cc @agstephens

Thanks for keeping me informed about your plans. I hope I have more bandwidth going forward. If you think a 1:1 chat would help to discuss this further, I can definitely hop on a zoom some time soon.

znicholls commented 2 years ago

Hi @jbusecke,

I am very sorry for not responding here. Things were hectic to say the least.

No issue at all, hope they've calmed down a bit. As you can probably tell, I'm not moving super fast on this at the moment either but I am starting to give it more thought now.

I've just started looking at intake-esm, which helpfully points right back here for pre-processing to deal with issues (https://pangeo-data.github.io/pangeo-cmip6-cloud/accessing_data.html#preprocessing-the-cmip6-datasets). Perhaps the simplest way forward then is that, as I start working with CMIP6 data again, I make issues with specific examples of pre-processing I would like to do/use cases and we can discuss (whether that be a) rtfd b) that would be cool to have in cmip6_preprocessing, but we don't have it yet (in which case I'll be happy to open a PR) c) that is handled somewhere else d) I have no idea how to handle that e) something else)?

Re errata and references: have a look at https://netcdf-scm.readthedocs.io/en/latest/usage/using-cmip-data.html. Note that it's just 'official' errata, there's no capability for handling e.g. incorrect reporting of units (so wouldn't help with #215, but I'm super interested to see if there is a way to get a solution to #215 into this package)

Re joining etc.: sounds good to keep it here. I'll see if I find any use cases that aren't covered then can happily make a PR.

jbusecke commented 2 years ago

Perhaps the simplest way forward then is that, as I start working with CMIP6 data again, I make issues with specific examples of pre-processing I would like to do/use cases and we can discuss

Yes that would be very much appreciated!

I'll see if I find any use cases that aren't covered then can happily make a PR.

That is super awesome. Thankful for any support here.

jbusecke / xMIP

Porting functionality #149