ioos / APIRUS

API for Regular, Unstructured and Staggered model output (or API R US)
Creative Commons Zero v1.0 Universal
2 stars 1 forks source link

One code base? #3

Open ChrisBarker-NOAA opened 8 years ago

ChrisBarker-NOAA commented 8 years ago

While, particularly with Python's duck-typing, we can develop pyugrid and pysgrid entirely independently, maybe that's not the best way to go.

In particular, the code for handling time, vertical coordinates, and reading netcdf could/should be the same.

One option is to create libraries for each of these functionalities that both packages to use, but it might be easier to simply put them all in one package.

ocefpaf commented 8 years ago

One option is to create libraries for each of these functionalities that both packages to use.

I am an advocate for splitting things up. I would love to see these parts that you mentioned as separate packages. However, let me just elaborate a little bit more.

handling time -> I guess that is the easy part. We are good by just converting datetimes and/or plugging dates into pandas/xray/iris. (Note that pandas and xray might choke on some CF-weird-allowed dates, but iris is OK with those. Because of that I think we are better leaving this decision to the end user.)

vertical coordinates -> I can see a very simple module that, giving the proper variables, compute Z. The module would be only functions named after vertical coordinate. The next step would be finding the input variables automatically via the @kwilcox's get_variables_by_attributes , which is part of netCDF4 now :wink:, and wrap all that up in a make_z(ncfile/ulr) method. If we do things right the functions can be called by iris, xray to create their objects' Z or called independently to get a numpy array of Z. That way we keep things light and bypass any higher level CF object creation. (The best and worst feature of iris BTW.)

reading netcdf -> Here I will contradict myself because I'd love to see a high level CF-object interpretation. Exactly what iris does but without the "cube."

The main problem here is the 1-object to 1-phenomena mapping (the iris cube) versus the Dataset concept (xray.Dataset). The two do not seem to go together! To me that is a flaw in the CF model. Because even in the case of all variables having the same coordinates we cannot create a model because there is no unique identifier to the phenomena. standard_name may repeat and variables names will map OK to only one Dataset but are not OK for universal reading.

rsignell-usgs commented 8 years ago

@ocefpaf, when you say: "Because even in the case of all variables having the same coordinates we cannot create a model because there is no unique identifier to the phenomena. standard_name may repeat and variables names will map OK to only one Dataset but are not OK for universal reading." are you supporting the idea of data variable as the fundamental data object, as Iris currently does?

ocefpaf commented 8 years ago

Are you supporting the idea of data variable as the fundamental data object, as Iris currently does?

Yes. I did not liked this at first, but after scipy I confess that @pelson convinced me.

From the CF document:

   2.5. Variables
   This convention does not standardize variable names.

In my view the problem is that CF has no Dataset concept. To implement a Dataset object we have to be flexible with CF and accept some things like variable names as key, like xray does.

rsignell-usgs commented 8 years ago

When you say "the problem is that CF has no Dataset concept", it sounds like you are in favor of a dataset model. But really all that the dataset provides is shared coordinates and shared metadata, which really just gets propagated down the variable object anyway when you need to use it.

So I guess I'm saying I'm coming around to the Iris way of doing things. I think we just like datasets because that's what NetCDF files look like and we've been using them for 25 years. And if we want them to keep looking like datasets, we can, because when you save from Iris to netcdf, it looks to see if the cubes have metadata in common, and if they do, it writes those as global attributes.

ocefpaf commented 8 years ago

When you say "the problem is that CF has no Dataset concept", it sounds like you are in favor of a dataset model.

Actually I am more inclined to the no-Dataset model. However, many people are still attached to the Dataset model because it is easy to wrap our minds to 1-file to 1-dataset. This mapping makes sense for those working with an individual model output, buoy, or mooring. In my case, where I need to compare a certain variable across many different data sources, the 1-object per phenomena makes perfect sense.

Not sure if CF had a no-Dataset concept in mind when they created the conventions, nor if people will abandon the Dataset concept easily. (Just note the success that xray has when people see a repr that is closely related to the netCDF4-python repr.)

In essence we will need to loop over the variables at some point, if it is in the Dataset or in a (Cube)list it does not matter.

One issue that comes out of this is saving the variables back into a netCDF4 with the proper global attributes, but I guess that iris has that covered. From iris docs:

   The attributes dictionaries on each cube in the saved cube list will be compared and common    
   attributes saved as NetCDF global attributes where appropriate.

Iris has also some nice convenience methods in the CubeList like .merge(), .merge_cube(), extract and .extract_overlapping() that work the CubeList like if it were a Dataset.

ChrisBarker-NOAA commented 8 years ago

I've bounded around with this a lot too. I found the Iris cube concept unnerving at first. now I'm on the fence...

But I think maybe we can have both -- in a sense, and IRIS CubeList (is that an special object, or just a list?) is a DataSet.

So -- could we have a "DataSet" object, that holds a bunch of individual data objects, where the individual data objects can be used by themselves (so they know about the grid they are attached to, etc), and you can also work with en entire dataset if you like.

One thing that is annoying with Iris is that a CubeList is a list: i.e. the cubes don't have names, and it's ordered, which does make sense. I understand that the names of variables have no special significance in CF -- but they are guaranteed to be unique, and generally are what people expect to see. So I think preserving and working with the names of data is fine, as long as we provide methods for finding out the name of variable from a standard_name r whatever.

So a DataSet object might look like a dict.

Keep in mind that a data object would need to know about the dataset, for grid info, etc, and vice versa, so we need to be careful about circular references...

This is all not unlike the netCDF4 lib -- you can et a DataSet and you can get tis variables, and then work just with a variable, if you like. It'll just have to be smarter...

-CHB

ocefpaf commented 8 years ago

I've bounded around with this a lot too. I found the Iris cube concept unnerving at first. now I'm on the fence...

It appears that we all have been there :smile:

I understand that the names of variables have no special significance in CF -- but they are guaranteed to be unique.

Unique in a single dataset. When loading and manipulating more than one dataset you will find yourself creating a new dataset (new keys for the dictionary) at every operation (e.g.: {temp: temp_var} will become {model1_temp: temp_var, model2_temp: temp_var}). Not that this is a problem, and we will have to do this for the Cube objects also. I guess it will be an issue only when merging datasets. Not sure how often that happens though.

as long as we provide methods for finding out the name of variable from a standard_name or whatever.

That is implemented in iris via the .extract() method. And now we have something similar in the raw netCDF4-python object via the get_variable_by_attributes method (note that there is no compliance checking in the get_variable_by_attributes). Xray still lacks a convince method to find variable via attributes or CF-rules.

So a DataSet object might look like a dict.

I asked @pelson about that at SciPy and he explained to me the problem the CF var names and how that moved them away from the dict model. Maybe he can add more to the discussion here because I do not remember all the reason behind it. I just remember that they did consider a dict and there were some good reason for not using it at the time.

This is all not unlike the netCDF4 lib -- you can et a DataSet and you can get tis variables, and then work just with a variable, if you like. It'll just have to be smarter...

Shameless plug :stuck_out_tongue_winking_eye: No need to be smarter just use get_variable_by_attributes [1]

[1] https://ocefpaf.github.io/python4oceanographers/blog/2015/09/21/netcdf4/

ChrisBarker-NOAA commented 8 years ago

""" No need to be smarter just use get_variable_by_attributes [1] """

yup -- very cool. I like all of you, had hacked together utilite sike that -- thanks to all that took the time to push it upstream!

but we need to make a version that is not dependent on netcdf4 in our DataSet object, anyway -- not hard to do.

As for Iris not using a dict interface -- I still don't get it -- I can't see a downside to have a nice key to can index by -- you wouldn't lose functionality, only gain some....

-CHB

ocefpaf commented 8 years ago

yup -- very cool. I like all of you, had hacked together utilite sike that -- thanks to all that took the time to push it upstream!

You can always ping me and I will do the salmon work of getting your code upstream :wink: (BTW I am working on incorporating your gravity wave code into my own.)

but we need to make a version that is not dependent on netcdf4 in our DataSet object, anyway -- not hard to do.

Easy actually I already did one for xray.

As for Iris not using a dict interface -- I still don't get it -- I can't see a downside to have a nice key to can index by -- you wouldn't lose functionality, only gain some....

Only the iris devs can elaborate more on that. I see a problem with merging only, but that is not a big issue in the light of the advantages.