tomchor / Oceanostics.jl

Diagnostics for Oceananigans
https://tomchor.github.io/Oceanostics.jl/
MIT License
24 stars 9 forks source link

Use `FieldTimeSeries` for analysis? #181

Open glwagner opened 1 week ago

glwagner commented 1 week ago

I think it's a bit nicer to use our built-in solution, since it gels with plotting and enables working with AbstractOperations in post-processing:

https://github.com/tomchor/Oceanostics.jl/blob/f7573d9b0c0c6ce1eeee95fd1482a32c2e569f30/docs/examples/two_dimensional_turbulence.jl#L105-L107

Also nice to be diligent about using it in docs, which encourages users to help contribute to developing its features.

glwagner commented 2 days ago

Just want to bump this and make sure it is read --- I think it's important to use FieldTimeSeries.

tomchor commented 2 days ago

I made the choice to stick with NetCDF because that's how the majority of users (based on the people I know at least) use it to process Oceananigans output and the Oceananigans examples are already all written with JLD2 output (with I think two exceptions only). Once FieldTimeSeries accepts NetCDF, we should switch.

tomchor commented 2 days ago

I know this isn't the place but, out of curiosity, can you explain the motivation behind building an in-house solution from scratch? DimensionalData has very nice capabilities that abstract away indices (much like xarray), with some nice visualizations. I definitely don't want to downplay the development of FieldTimeSeries, but it seems like contributing to that already well-established package would get us more capabilities while benefiting a larger Julia community.

glwagner commented 2 days ago

FieldTimeSeries does not overlap with DimensionalData. DimArray is a lower-level object than a FieldTimeSeries, and in fact we can envision subtyping FieldTimeSeries / Field as an AbstractDimArray. This has the potential to supply array slicing utilities directly in Field and FieldTimeSeries. Then the DimensionalData syntax could be used both for building output writers (eg the indices kwarg in output writers) as well as for post-processing.

Suggesting that we use DimensionalData instead of FieldTimeSeries is similar to suggesting that we use DimensionalData instead of Field. If we did that, we would no longer have AbstractOperations. Is that what we want?

FieldTimeSeries represents a time-series of Fields which can be in or out-of-core. For post-processing, the key power of FieldTimeSeries is that one can use all of the abstract operations, as well as plotting utilities, that are defined for Field. FieldTimeSeries have grids, boundary conditions, indices, and staggered locations, all of which are not within scope for DimensionalData. But more than that, FieldTimeSeries also supports time-interpolation and implements an interface for different data backends -- thereby also supporting forcing models with discrete data from diverse sources, like reanalysis, other model output, etc.

We use FieldTimeSeries to represent lazily-loaded atmospheric states for forcing regional and global simulations in ClimaOcean. We also use FieldTimeSeries to represent other types of data loaded from file, like the ECCO state estimate, for restoring. We will use FieldTimeSeries for open boundary conditions in the future.

It's not really important whether the "majority of users" interface with NetCDF (even if that were true, which I doubt). What matters more is our vision for next-generation ocean modeling software. My argument is that we can build something much more productive and powerful if we recognize that the problems of post-processing and online diagnostics are in fact, exactly the same. The typical approach is to build different systems for those two things. If we build one system, we'll have a lot more with less code and less work.

glwagner commented 2 days ago

I made the choice to stick with NetCDF because that's how the majority of users (based on the people I know at least) use it to process Oceananigans output and the Oceananigans examples are already all written with JLD2 output (with I think two exceptions only).

I think the priority should be to use FieldTimeSeries -- because it's so important --- which also makes it prerogative to support NetCDF with FieldTimeSeries in the even that using NetCDF is important for some reason.

That said, I don't think the output format is very crucial if we are working within Julia. NetCDF has the advantage of being useful outside of Julia, but this is moot for a Julia package (like Oceanostics). On the other hand JLD2 is more lightweight and portable (there are systems that don't even support NetCDF), so JLD2 is absolutely crucial.

tomchor commented 2 days ago

FieldTimeSeries does not overlap with DimensionalData. DimArray is a lower-level object than a FieldTimeSeries, and in fact we can envision subtyping FieldTimeSeries / Field as an AbstractDimArray. This has the potential to supply array slicing utilities directly in Field and FieldTimeSeries. Then the DimensionalData syntax could be used both for building output writers (eg the indices kwarg in output writers) as well as for post-processing.

FieldTimeSeries represents a time-series of Fields which can be in or out-of-core. For post-processing, the key power of FieldTimeSeries is that one can use all of the abstract operations, as well as plotting utilities, that are defined for Field. FieldTimeSeries have grids, boundary conditions, indices, and staggered locations, all of which are not within scope for DimensionalData. But more than that, FieldTimeSeries covers any work with discrete time-series --- including forcing models with discrete data. We use FieldTimeSeries to represent lazily-loaded atmospheric states for forcing regional and global simulations in ClimaOcean. We also use FieldTimeSeries to represent other types of data loaded from file, like the ECCO state estimate, for restoring. We will use FieldTimeSeries for open boundary conditions in the future.

Gotcha, I wasn't aware of that distinction. Thanks for the clarification. I do like the idea of subtyping AbstractDimArray given some of the capabilities of the latter.

It's not really important whether the "majority of users" interface with NetCDF (even if that were true, which I doubt).

I disagree. I think people come to the examples section of the docs as a starting point to write their own scripts. So providing them with something they're more likely to use is helpful. I don't oppose using JLD2 and NetCDF in different examples though. I just don't think we should switch all examples completely.

What matters more is our vision for next-generation ocean modeling software. My argument is that we can build something much more productive and powerful if we recognize that the problems of post-processing and online diagnostics are in fact, exactly the same. The typical approach is to build different systems for those two things. If we build one system, we'll have a lot more with less code and less work.

I agree with that too. The issue in my view is that xarray has a huge head start in solving these issues, so people are more likely to use it.

glwagner commented 2 days ago

I disagree. I think people come to the examples section of the docs as a starting point to write their own scripts. So providing them with something they're more likely to use is helpful.

But following this logic, we could never implement / demonstrate a new feature, because nobody would use it yet.

I think we need to document the way we want things to be done. If users persist in doing things differently, we need to figure out why and make sure we work on features that are useful.

glwagner commented 2 days ago

The issue in my view is that xarray has a huge head start in solving these issues, so people are more likely to use it.

Why is this an issue? I think in principle we can bring xarray features into FieldTimeSeries (this would add a huge dependency but is possible in principle). It's not correct to think of FieldTimeSeries as replacing or supplanting xarray. They are orthogonal.

ali-ramadhan commented 1 day ago

I'm also a huge proponent for NetCDF for multiple reasons (even for fully Julian workflows) so I'm interested in making FieldTimeSeries work with NetCDF output.

I'd also love some of the niceties of xarray, DimensionalData.jl, and RasterStack. I do like the idea of opening issues for features we'd like.

Maybe we can work together on this, and resolve this issue in the process?

https://github.com/CliMA/Oceananigans.jl/pull/2652 may be the next step. Then I was thinking of opening a PR to modernize or clean up NetCDFOutputWriter (and make it more flexible). Then a PR to make it work with FieldTimeSeries.

glwagner commented 1 day ago

I think NetCDF has important applications, despite some downsides like worse performance and reduce portability. The most important deficiency is that calculations from the simulation cannot be reproduced when saving in NetCDF because Field cannot be reconstructed. I think this is a major risk of saving in NetCDF; even though it may be convenient to analyze with xarray, the price of ruling out precise analysis has a major scientific cost. It's a lot to pay for convenience.

Performance and portability have to be solved elsewhere but we can do something about this reproducibility problem. We need to make it easier to rebuild native Oceananigans types from NetCDF data. This could be part of an effort to support FieldTimeSeries. I agree that the ability to reconstruct the Oceananigans grid is probably the core issue. There are some auxiliary ones too, such as saving location, indices.

As for DimensionalData / RasterStack, they seem convenient so it would be nice to leverage them. It seems like the right way to do this is to make AbstractField subtype AbstractDimArray, rather than merely AbstractArray as it currently does.

tomchor commented 1 day ago

Maybe we can work together on this, and resolve this issue in the process?

CliMA/Oceananigans.jl#2652 may be the next step. Then I was thinking of opening a PR to modernize or clean up NetCDFOutputWriter (and make it more flexible). Then a PR to make it work with FieldTimeSeries.

I like this plan. Although https://github.com/CliMA/Oceananigans.jl/pull/2652 is old enough that it might be preferable to start a new PR. And if that's true it might be preferable to start with modernizing the NetCDFWriter. I'll start by checking https://github.com/CliMA/Oceananigans.jl/pull/2652 and then we can discuss.

ali-ramadhan commented 1 day ago

I agree that the ability to reconstruct the Oceananigans grid is probably the core issue. There are some auxiliary ones too, such as saving location, indices.

Yeah I don't think it'll be pretty, but we should be able to save enough metadata in the NetCDF file to do this.

It seems like the right way to do this is to make AbstractField subtype AbstractDimArray, rather than merely AbstractArray as it currently does.

I know this came up in the past but it sounded like it would be too much work and refactoring? I could be wrong. If we just want a few features then it could be easier to implement just those?

Although https://github.com/CliMA/Oceananigans.jl/pull/2652 is old enough that it might be preferable to start a new PR. And if that's true it might be preferable to start with modernizing the NetCDFWriter.

Sounds good! I don't know if we should open a new issue to discuss but yeah I do have my own list of NetCDFOutputWriter grips haha.

glwagner commented 1 day ago

I know this came up in the past but it sounded like it would be too much work and refactoring? I could be wrong. If we just want a few features then it could be easier to implement just those?

I can't remember. But of course, there is always a trade-off between cost/work and benefit gained. If people are willing to absorb the huge penalty of going to xarray for analysis in order to get slicing then to me that suggests there is a pretty significant benefit. That assumes that the penalties are understood and rational decisions are made of course...