tobac-project / tobac

Tracking and object-based analysis of clouds
BSD 3-Clause "New" or "Revised" License
99 stars 53 forks source link

Add metadata and description of variables to output files #401

Open JuliaKukulies opened 8 months ago

JuliaKukulies commented 8 months ago

As part of the xarray transition, we should add some metadata and description of variables to the output files that are created with tobac. Part of it can be left to the user (e.g. the user-specific bulk statistics), but for projects like MCSMIP where tobac data is shared and published, it would be helpful to open the files and see what our definitions of variables are (e.g., what we currently only have listed here ).

freemansw1 commented 8 months ago

Entirely agreed. This is a key component of being good citizens of FAIR principles (https://www.go-fair.org/fair-principles/). #354 doesn't necessarily get us all the way there for that; our feature detection output will still be a Pandas DataFrame at the moment, which has frustratingly limited metadata options.

We have a couple options for resolving that issue; we could simply output xarray if users input xarray rather than iris data. The issue there is that our users likely don't have a workflow set up for that xarray data (but they would have to opt into using xarray by changing their workflow anyway). We could also make it an option, and decide down the road whether to disable or make pandas non-default for output.

After #354, but before 1.6.0 releases, I think we should make sure that we have an xarray output option with the appropriate metadata. Perhaps that would be a good topic for the tobathon next week. How we implement it (default or an option) would be a good discussion; I think there are reasonable points on both sides.

Longer-term, we should have options (I think there's another issue for this) to output/combine into a single file, although that gets challenging with how large segmentation output can get.

JuliaKukulies commented 8 months ago

We have a couple options for resolving that issue; we could simply output xarray if users input xarray rather than iris data. The issue there is that our users likely don't have a workflow set up for that xarray data (but they would have to opt into using xarray by changing their workflow anyway). We could also make it an option, and decide down the road whether to disable or make pandas non-default for output.

I think outputting xarray is the way to go because, as you say, with the xarray transition, users have to change their workflow anyhow. And yes, it is frustrating that pandas dataframes have so limited options for metadata, and a question that I think we have not discussed extensively is whether we only want to switch from iris to xarray or also replace all pandas dataframe operations internally. Pandas dataframes still have some very useful functionalities, so maybe it would make sense to output even the features as xarray but keep pandas internally? I am not sure about this.

After #354, but before 1.6.0 releases, I think we should make sure that we have an xarray output option with the appropriate metadata. Perhaps that would be a good topic for the tobathon next week. How we implement it (default or an option) would be a good discussion; I think there are reasonable points on both sides.

Good idea, I also thought that this is something we could take up at the tobathon since it would be useful to get input from users who are not currently developers.

Longer-term, we should have options (I think there's another issue for this) to output/combine into a single file, although that gets challenging with how large segmentation output can get.

Do you mean something like our tobac.utils..combine_feature_dataframes functionality but more internal so that users can input a list of files/dataframes for tracking and output them all into a single file?