Open glwagner opened 1 week ago
Just want to bump this and make sure it is read --- I think it's important to use FieldTimeSeries
.
I made the choice to stick with NetCDF because that's how the majority of users (based on the people I know at least) use it to process Oceananigans output and the Oceananigans examples are already all written with JLD2 output (with I think two exceptions only). Once FieldTimeSeries
accepts NetCDF, we should switch.
I know this isn't the place but, out of curiosity, can you explain the motivation behind building an in-house solution from scratch? DimensionalData has very nice capabilities that abstract away indices (much like xarray), with some nice visualizations. I definitely don't want to downplay the development of FieldTimeSeries
, but it seems like contributing to that already well-established package would get us more capabilities while benefiting a larger Julia community.
FieldTimeSeries
does not overlap with DimensionalData
. DimArray
is a lower-level object than a FieldTimeSeries
, and in fact we can envision subtyping FieldTimeSeries
/ Field
as an AbstractDimArray
. This has the potential to supply array slicing utilities directly in Field
and FieldTimeSeries
. Then the DimensionalData
syntax could be used both for building output writers (eg the indices
kwarg in output writers) as well as for post-processing.
Suggesting that we use DimensionalData
instead of FieldTimeSeries
is similar to suggesting that we use DimensionalData
instead of Field
. If we did that, we would no longer have AbstractOperations
. Is that what we want?
FieldTimeSeries
represents a time-series of Field
s which can be in or out-of-core. For post-processing, the key power of FieldTimeSeries
is that one can use all of the abstract operations, as well as plotting utilities, that are defined for Field
. FieldTimeSeries
have grids, boundary conditions, indices, and staggered locations, all of which are not within scope for DimensionalData
. But more than that, FieldTimeSeries
also supports time-interpolation and implements an interface for different data backends -- thereby also supporting forcing models with discrete data from diverse sources, like reanalysis, other model output, etc.
We use FieldTimeSeries
to represent lazily-loaded atmospheric states for forcing regional and global simulations in ClimaOcean. We also use FieldTimeSeries
to represent other types of data loaded from file, like the ECCO state estimate, for restoring. We will use FieldTimeSeries
for open boundary conditions in the future.
It's not really important whether the "majority of users" interface with NetCDF (even if that were true, which I doubt). What matters more is our vision for next-generation ocean modeling software. My argument is that we can build something much more productive and powerful if we recognize that the problems of post-processing and online diagnostics are in fact, exactly the same. The typical approach is to build different systems for those two things. If we build one system, we'll have a lot more with less code and less work.
I made the choice to stick with NetCDF because that's how the majority of users (based on the people I know at least) use it to process Oceananigans output and the Oceananigans examples are already all written with JLD2 output (with I think two exceptions only).
I think the priority should be to use FieldTimeSeries
-- because it's so important --- which also makes it prerogative to support NetCDF with FieldTimeSeries
in the even that using NetCDF is important for some reason.
That said, I don't think the output format is very crucial if we are working within Julia. NetCDF has the advantage of being useful outside of Julia, but this is moot for a Julia package (like Oceanostics). On the other hand JLD2 is more lightweight and portable (there are systems that don't even support NetCDF), so JLD2 is absolutely crucial.
FieldTimeSeries
does not overlap withDimensionalData
.DimArray
is a lower-level object than aFieldTimeSeries
, and in fact we can envision subtypingFieldTimeSeries
/Field
as anAbstractDimArray
. This has the potential to supply array slicing utilities directly inField
andFieldTimeSeries
. Then theDimensionalData
syntax could be used both for building output writers (eg theindices
kwarg in output writers) as well as for post-processing.
FieldTimeSeries
represents a time-series ofField
s which can be in or out-of-core. For post-processing, the key power ofFieldTimeSeries
is that one can use all of the abstract operations, as well as plotting utilities, that are defined forField
.FieldTimeSeries
have grids, boundary conditions, indices, and staggered locations, all of which are not within scope forDimensionalData
. But more than that,FieldTimeSeries
covers any work with discrete time-series --- including forcing models with discrete data. We useFieldTimeSeries
to represent lazily-loaded atmospheric states for forcing regional and global simulations in ClimaOcean. We also useFieldTimeSeries
to represent other types of data loaded from file, like the ECCO state estimate, for restoring. We will useFieldTimeSeries
for open boundary conditions in the future.
Gotcha, I wasn't aware of that distinction. Thanks for the clarification. I do like the idea of subtyping AbstractDimArray
given some of the capabilities of the latter.
It's not really important whether the "majority of users" interface with NetCDF (even if that were true, which I doubt).
I disagree. I think people come to the examples section of the docs as a starting point to write their own scripts. So providing them with something they're more likely to use is helpful. I don't oppose using JLD2 and NetCDF in different examples though. I just don't think we should switch all examples completely.
What matters more is our vision for next-generation ocean modeling software. My argument is that we can build something much more productive and powerful if we recognize that the problems of post-processing and online diagnostics are in fact, exactly the same. The typical approach is to build different systems for those two things. If we build one system, we'll have a lot more with less code and less work.
I agree with that too. The issue in my view is that xarray has a huge head start in solving these issues, so people are more likely to use it.
I disagree. I think people come to the examples section of the docs as a starting point to write their own scripts. So providing them with something they're more likely to use is helpful.
But following this logic, we could never implement / demonstrate a new feature, because nobody would use it yet.
I think we need to document the way we want things to be done. If users persist in doing things differently, we need to figure out why and make sure we work on features that are useful.
The issue in my view is that xarray has a huge head start in solving these issues, so people are more likely to use it.
Why is this an issue? I think in principle we can bring xarray features into FieldTimeSeries
(this would add a huge dependency but is possible in principle). It's not correct to think of FieldTimeSeries
as replacing or supplanting xarray. They are orthogonal.
I'm also a huge proponent for NetCDF for multiple reasons (even for fully Julian workflows) so I'm interested in making FieldTimeSeries
work with NetCDF output.
I'd also love some of the niceties of xarray, DimensionalData.jl, and RasterStack
. I do like the idea of opening issues for features we'd like.
Maybe we can work together on this, and resolve this issue in the process?
https://github.com/CliMA/Oceananigans.jl/pull/2652 may be the next step. Then I was thinking of opening a PR to modernize or clean up NetCDFOutputWriter
(and make it more flexible). Then a PR to make it work with FieldTimeSeries
.
I think NetCDF has important applications, despite some downsides like worse performance and reduce portability. The most important deficiency is that calculations from the simulation cannot be reproduced when saving in NetCDF because Field cannot be reconstructed. I think this is a major risk of saving in NetCDF; even though it may be convenient to analyze with xarray, the price of ruling out precise analysis has a major scientific cost. It's a lot to pay for convenience.
Performance and portability have to be solved elsewhere but we can do something about this reproducibility problem. We need to make it easier to rebuild native Oceananigans types from NetCDF data. This could be part of an effort to support FieldTimeSeries
. I agree that the ability to reconstruct the Oceananigans grid is probably the core issue. There are some auxiliary ones too, such as saving location, indices.
As for DimensionalData / RasterStack, they seem convenient so it would be nice to leverage them. It seems like the right way to do this is to make AbstractField
subtype AbstractDimArray
, rather than merely AbstractArray
as it currently does.
Maybe we can work together on this, and resolve this issue in the process?
CliMA/Oceananigans.jl#2652 may be the next step. Then I was thinking of opening a PR to modernize or clean up
NetCDFOutputWriter
(and make it more flexible). Then a PR to make it work withFieldTimeSeries
.
I like this plan. Although https://github.com/CliMA/Oceananigans.jl/pull/2652 is old enough that it might be preferable to start a new PR. And if that's true it might be preferable to start with modernizing the NetCDFWriter
. I'll start by checking https://github.com/CliMA/Oceananigans.jl/pull/2652 and then we can discuss.
I agree that the ability to reconstruct the Oceananigans grid is probably the core issue. There are some auxiliary ones too, such as saving location, indices.
Yeah I don't think it'll be pretty, but we should be able to save enough metadata in the NetCDF file to do this.
It seems like the right way to do this is to make
AbstractField
subtypeAbstractDimArray
, rather than merelyAbstractArray
as it currently does.
I know this came up in the past but it sounded like it would be too much work and refactoring? I could be wrong. If we just want a few features then it could be easier to implement just those?
Although https://github.com/CliMA/Oceananigans.jl/pull/2652 is old enough that it might be preferable to start a new PR. And if that's true it might be preferable to start with modernizing the
NetCDFWriter
.
Sounds good! I don't know if we should open a new issue to discuss but yeah I do have my own list of NetCDFOutputWriter
grips haha.
I know this came up in the past but it sounded like it would be too much work and refactoring? I could be wrong. If we just want a few features then it could be easier to implement just those?
I can't remember. But of course, there is always a trade-off between cost/work and benefit gained. If people are willing to absorb the huge penalty of going to xarray for analysis in order to get slicing then to me that suggests there is a pretty significant benefit. That assumes that the penalties are understood and rational decisions are made of course...
I think it's a bit nicer to use our built-in solution, since it gels with plotting and enables working with
AbstractOperations
in post-processing:https://github.com/tomchor/Oceanostics.jl/blob/f7573d9b0c0c6ce1eeee95fd1482a32c2e569f30/docs/examples/two_dimensional_turbulence.jl#L105-L107
Also nice to be diligent about using it in docs, which encourages users to help contribute to developing its features.