Open remicousin opened 4 months ago
And that was run on storm.
and... in the case on applying a where on the daily data, I also got a Performance warning.
Precomputing the resampling certainly seems like a reasonable thing to do in a situation like tihs. The downside is having another derived dataset to maintain (more transformation steps that can break or we can forget to run, more places to look when we're tracking down an error). If we don't have a use for the daily data, you could merge the resampling step into the original zarrification script, so there's only one level of derivation instead of two.
If you're certain that seasons will always be three months long, precomputing seasonal values is fine, but a more flexible alternative would be to precompute monthly values and calculate seasonal from monthly on the fly.
Right... as always... we are driven by projects that don't give the time to make the proper building blocks. What you are describing, to me, is basically the beginning of data cataloguing for a library of zarr stores... This should be high up in our priority tasks...
It turns out that using rolling over monthly data still takes over 3mn (on storm) to do the work for 30 years. So I resorted to cutting down the size of the monthly object to the months of interest and then deal with that. This basically meant applying the strategy used in enacts/calc.py for seasonal calculations on daily data. It's now 30s do get 30 years of a season for all X/Y. For time series, it's probably going to be all fine since it will do it for just one X/Y. For maps, it might still be borderline... We have 2 of them to compute to make the map (anomalies) so it would be 1mn... But if there is parallelism by tiles, that might be fine? If not, we can precompute that historical one since it's going to be the same for all analyses. So that would be back to 30s to be parallelized.
I'll add proper functions description once approved.
How do I run this?
with enactsmaproom environemnt, then, e.g.: data = read_data("historical", "GFDL-ESM4", "pr") seas_data = seasonal_data(data, 2, 4, start_year="1981", end_year="2010")
We're going to reuse the season selection logic everywhere. Can we extract it into pingrid, and have enacts and pepsico share it?
Once extracted, it should have automated tests to demonstrate that it works as intended in various situations, e.g. when the season spans Jan 1.
Ideally it would handle (sub-)seasons that aren't necessarily composed of whole months, but that's probably too much to ask right now.
This code puts the "end edge" of the FMA season at midnight on April 1, i.e. all of April is excluded. I guess that works ok as long as all the data are on midnight on the first of the month, but it's likely to cause errors down the road, e.g. if someone calculates the season length as
season_end - season_start
. I'm not sure what the right answer is. If we make it April 30 at 23:59:59, there's still one second missing. Mathematically, I think the thing to do would be to say that time intervals are closed at the start and open at the end, and use midnight on March 1 as the end of the FMA season. But that might make the bounds difficult to use with xarray, because xarray slicing is always closed on both ends. An alternative might be to define seasons by a start point and a length, leaving the end implicit. But that might be hard to use?
Which logic to apply everywhere is pending more work on how we want to operate on time more in general. The last time we worked on that, which dates back a bit, you were more inclined to opt for a more splitstreamgrid philosophy, so that the same "seasonal" analysis can be applied in parallel to multiply "years," and that would probably mean much less relying on groupby and labelling.
This is heavily inspired by the version applied to daily and one could imagine that there could be a bunch to factor out and to put in a common to have some master seasonal function that works for all sort of time resolution of the input, and whether the months are whole or not. So yes, a bigger project too.
There are some tests for the enacts/calc.py version.
I see the problem about seasons_ends... that sounds like another argument for center + width... But in any case, per the above, we are far from a generalization case... so I am not sure what to do in the immediate...
More generally speaking, there will be a bunch, other than the calculation, to make common to enacts and pepsico (like a zarrification library) so there is a bigger proiect to have to clean all that up.
A consideration in favor of bounds rather than midpoint (or start) + width is that CF supports cell bounds but not widths. That means we should (eventually) have the ability to netcdf or zarr files with bounds; it doesn't necessarily mean our calculation code needs to use that representation internally, but it would be simpler (in some ways) if it did.
If we did go with a width representation, we could either express the width in days (in which case width would be an auxiliary coordinate, not a constant attribute), or use pandas.DateOffset, which does support months.
More generally speaking, there will be a bunch, other than the calculation, to make common to enacts and pepsico (like a zarrification library) so there is a bigger proiect to have to clean all that up.
The more time elapses before you do that, the harder it will be to remember how this code works. But I understand that Sheen's departure puts time pressure on this.
I think I've contributed what I can. Anything you don't have time to address now can be addressed later... but hopefully not too much later.
Right, the bounds/width/center is a similar problem to that start/lead/target one: you need only 2 of the 3 elements to describe the system fully, the relationship to switch representations is well known, and there are applications or practices that work smoother with one or the other.
I am still hopeful to work on the enactstozarr changes to have it part of dlupdate script before Nigeria, and that may be an opportunity to rationalize the commonalities enacts/pepsico on that matter. The rest will likely have to wait for October... (NIgeria will eat up August and CASA September).
Ingrid can return bounds from its grids, btw: e.g.
Ingrid can return bounds from its grids, btw: e.g.
That's a link to the CF docs. Did you mean to give an Ingrid link?
There also seems to be the concept of Periods (not periods) in pandas but I haven't gotten as far as I wished reading about it. It looks like it's more a labeling concept than a mathematical one.
What's the status of this PR? There's some useful discussion here that perhaps we should preserve elsewhere.
I need to finish some form of this app by the end of calendar year. Worked stalled by CASA and vacations. I am returning to it significantly when I return. We can discuss then what to do for the very app, and what to preserver elsewhere for later (in Issues I would think).
ok. I rebased so that I can get the seasonal calculations and start replacing the map place-holder with an actual result of interest (and make the time series graph), and appropriate controls and have a sharable beta version.
I figure this is what this PR should try to achieve, as I need some initial deliverable this calendar year. However, there is more to it mentioned in this PR. It's all up there but in summary: there is a long-term issue about general time representation and its consequences on libraries to reduce daily/monthly data to yearly seasonal data; and a mid-term issue of factoring out the common content between pepsico and enacts.
While I need some initial output for Pepsico to report on this CY, I have through end of March 2025 to complete. I have some 1.5 mo of effort from them for that. I could use CC effort to complement and work on the time/factorization issues (I don't have anything but CC starting Jan -- plus likely help me Aaron with ENACTS Maprooms from Tufa's Ethiopia project). So in light of this I would suggest to (in rather chronological order):
yes?
This is basically what we need for the Pepsico App for this work.
It takes over 4mn to print seas (and less than 2 if applying a where on the daily data -- commented -- rather than on the seasonal data). Knowing that for the map, we'll need to compute 2 such seas (to make the difference between a projection and a historical reference). For the the timeseries, we'll need to apply to 4 of them but only at one spatial point.
We're doing nothing under the time resolution of the 3-month long season... so shall we write another zarr storage for all the variables that basically would be the result of .resample(T="1M").mean().rolling(T=3, center=True).mean().dropna("T")
Or is there another option to think of? (@aaron-kaplan )