spencerahill / aospy

Python package for automated analysis and management of gridded climate data
Apache License 2.0
82 stars 12 forks source link

Support sub-monthly time reduction intervals #204

Open chuaxr opened 7 years ago

chuaxr commented 7 years ago

By default, the output file from an aospy calculation is named with "...start_year-end_year.nc" or simply "start_year.nc". This means that using date ranges within the same year (e.g. 5 Aug-10 Aug, 10Aug -15 Aug) will result in only one .nc file instead of two.

As a workaround, I currently have the following changes in _file_name in calc.py:

    def _file_name(self, dtype_out_time, extension='nc'):
        """Create the name of the aospy file."""
        out_lbl = utils.io.data_out_label(self.intvl_out, dtype_out_time,
                                          dtype_vert=self.dtype_out_vert)
        in_lbl = utils.io.data_in_label(self.intvl_in, self.dtype_in_time,
                                        self.dtype_in_vert)
        ens_lbl = utils.io.ens_label(self.ens_mem)
        yr_lbl = utils.io.yr_label((self.start_date.year, self.end_date.year))
        ymd_start_lbl = self.start_date.strftime('%Y%m%d')
        ymd_end_lbl = self.end_date.strftime('%Y%m%d')
        return '.'.join(
            [self.name, out_lbl, in_lbl, self.model.name,
             self.run.name, ens_lbl, ymd_start_lbl, ymd_end_lbl, extension]
        ).replace('..', '.')

Presumably this level of precision would not be necessary for users averaging over years of monthly averaged data, so perhaps this should be an option for users to turn on/off.

spencerahill commented 7 years ago

Thanks for the report. We should definitely support writing to sub-monthly periods. (You can tell from this example that we wrote aospy when our research was focused on longer timescales! 😝 )

This means that using date ranges within the same year (e.g. 5 Aug-10 Aug, 10Aug -15 Aug) will result in only one .nc file instead of two.

Strictly speaking, this only occurs if they are within the same year and month -- e.g. 5 Aug-10 Aug would not be overwritten by 10 Sept-15 Sept.

Presumably this level of precision would not be necessary for users averaging over years of monthly averaged data, so perhaps this should be an option for users to turn on/off.

We can make it adaptive so that it includes the minimum needed precision but no more, without the user having to specify anything. We already do this for the year, as you noted: YYYY if the dates span only a single year, rather than YYYY-YYYY.

The code snippet you provided is a nice starting point. We'll need to iterate though on what's the best way to represent more general dates. It occurs to me that our current method, with letters for annual means or multiple months (e.g. 'ann', 'djf', 'jas') but numbers for single months ('01' for January, etc.) separated from the year range by a '.' doesn't extend easily to shorter timescales. @spencerkclark has been thinking a lot in the past year about date/time representations, so I'll let him chime in before proceeding.

In the meantime, we should add a Note or Warning in the docs about this (or maybe a whole section/sub-section about using aospy with higher frequency data).


Also, there are likely other places where we have implicitly assumed a monthly or longer period for everything, so please keep reporting anything else along those lines. Thanks!

spencerkclark commented 6 years ago

Thanks for taking the time to write up this issue @chuaxr! @spencerahill it seems we have differing views on this :) Maybe I'm missing something important (examples illustrating your concerns might help), but I don't see any major issues with @chuaxr's suggestion.

Strictly speaking, this only occurs if they are within the same year and month -- e.g. 5 Aug-10 Aug would not be overwritten by 10 Sept-15 Sept.

If I understand the situation properly, I think @chuaxr was correct initially. I think the only way the start and end date for each calculation are currently encoded in file names is by the year of each. E.g. a calculation with a start date of 0003-01-01 and an end date of 0006-12-31 would have a file name of something like: var_name.ann.av.from_monthly_ts_sigma.model_name.run_name.0003-0006.nc. Therefore if one wanted to take the average of var_name over the period 0003-12-01 to 0006-03-31 as well, the original file would be overwritten (which is not desired).

We can make it adaptive so that it includes the minimum needed precision but no more, without the user having to specify anything. We already do this for the year, as you noted: YYYY if the dates span only a single year, rather than YYYY-YYYY.

We could think of making the precision adaptive; pandas has some notion of this, e.g.:

In [1]: import pandas as pd

In [2]: pd.__version__
Out[2]: u'0.20.3'

In [3]: pd._libs.tslib.parse_datetime_string_with_reso('2000')
Out[3]:
(datetime.datetime(2000, 1, 1, 0, 0),
 datetime.datetime(2000, 1, 1, 0, 0),
 'year')

In [4]: pd._libs.tslib.parse_datetime_string_with_reso('2000-01')
Out[4]:
(datetime.datetime(2000, 1, 1, 0, 0),
 datetime.datetime(2000, 1, 1, 0, 0),
 'month')

In [5]: pd._libs.tslib.parse_datetime_string_with_reso('2000-01-01')
Out[5]:
(datetime.datetime(2000, 1, 1, 0, 0),
 datetime.datetime(2000, 1, 1, 0, 0),
 'day')

But it is sort of hard to reverse engineer that from full-resolution datetimes without some assumptions (and the logic could get messy), which is what we would need to do at the moment. (In pandas the resolution is determined by how many digits are provided in the string-specification; right now in aospy we start straight from datetimes). I'm not totally sure this would be worth the trouble/code complexity.

The code snippet you provided is a nice starting point. We'll need to iterate though on what's the best way to represent more general dates. It occurs to me that our current method, with letters for annual means or multiple months (e.g. 'ann', 'djf', 'jas') but numbers for single months ('01' for January, etc.) separated from the year range by a '.' doesn't extend easily to shorter timescales.

I don't see a huge issue regarding conflict between the intvl_out label and the integer month label (but I could be convinced otherwise given an example where it might be a problem). To me the start and end date are sufficiently far from the the intvl_out label in the file names that I think it is clear which is which (e.g. the distance between ann and 0003-0006 in my example above). If we think this might be a problem another option we could consider would be altering the way we encode single month intvl_out specifications (e.g. maybe using the full month name rather than a zero-padded integer).

For those reasons I feel I would not be opposed to extending things out to daily resolution in the filenames for the start and end dates (in all circumstances to keep the logic simple). So my example above would look something like: var_name.ann.av.from_monthly_ts_sigma.model_name.run_name.0003-01-01.0006-12-31.nc. Here I'm sticking with ISO 8601 format for the dates, since that seems to be the most common throughout Python.

chuaxr commented 6 years ago

While I don't think I need it right now, it's not inconceivable that a user might need hourly (or even higher) resolution some day (e.g. tracking the formation of a particular storm event). I would therefore support a user-input time-format string.

I was able to pass a time_format_str argument (e.g. '%Y%m') from calc_suite_specs in aospy_main.py by adding time_format_str to _AUX_SPEC_NAMES and _NAMES_SUITE_TO_CALC in automate.py and CalcInterface in calc.py. Maybe this is something you'd like to implement?

spencerahill commented 6 years ago

Thanks both for your thoughts. I realize now I'm confused about the use case. @chuaxr, can you clarify:

  1. In words, are you trying to average over e.g. Aug 5-15 for one year only? OR averaging over Aug 5-15 over multiple years?
  2. In code, can you provide the precise values of output_time_intervals and date_ranges in calc_suite_specs?

If it's for a single year, then that's where our logic gets a bit odd: if the date range is exactly equal to Aug 5-15, then either 'ann' or 8 as the output_time_intervals would give identical results, because 'ann' keeps all data, and 8 keeps all August data, which are identical here. But using 8 is preferable, because then you can at least use 9 for September dates, and so on. This is what I meant by there having up to one file per month.

Apologies if I'm misunderstanding still. I need us to sort these things out before being able to think clearly about the rest of @spencerkclark's thoughts.

chuaxr commented 6 years ago

@spencerahill I'm only averaging within one year. Here's a concrete example that triggers the overwriting:

date_ranges=[(datetime.datetime(2001,8,19,1),datetime.datetime(2001,8,    29,0)),\
 (datetime.datetime(2001,8,9,1),datetime.datetime(2001,8,19,0))    ]

output_time_intervals=['ann'],
spencerkclark commented 6 years ago

@chuaxr just to give a little more background to your use-case, there's also no seasonal cycle in insolation in your model, correct? (So years or months don't hold any specific significance?)

chuaxr commented 6 years ago

Yup, the specific dates are only chosen for consistency with the arbitrary dates I used in the wrf run, and have no physical significance otherwise.

On Saturday, 16 September 2017, Spencer Clark notifications@github.com wrote:

@chuaxr https://github.com/chuaxr just to give a little more background to your use-case, there's also no seasonal cycle in insolation in your model, correct? (So years or months don't hold any specific significance?)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/spencerahill/aospy/issues/204#issuecomment-329986775, or mute the thread https://github.com/notifications/unsubscribe-auth/Acnf5qs1mtPFG3LJrfxeHB6cFdAQMU1Wks5sjBRJgaJpZM4PSAR7 .

spencerahill commented 6 years ago

Thanks @chuaxr, that's helpful.

Now that we have our bearings, I think the issue is ultimately just that we don't yet support sub-monthly output_time_interval values. This wouldn't be an issue for you if (something like) the following worked:

calc_suite_specs = dict(output_time_intervals=[('08-09', '08-19'), ('08-19', '08-29')], ...)

where each tuple corresponds to one of your desired averaging periods. Note that this would work whether your data spanned one year or less or multiple years; the number of years averaged over would be specified via date_ranges.

The resulting filename would be var_name.0809-0819.av.from_monthly_ts_sigma.model_name.run_name.YYYY.nc, where YYYY is the year (or YYYY-YYYY for multiple years (again modulo the precise label we choose to represent the submonthly ranges).

@spencerkclark our thoughts on this seem to have diverged. What is your take? I just keep coming back to feeling that we shouldn't recommend using ann in this case.

spencerkclark commented 6 years ago

@spencerahill I see your point regarding 'ann'. Thinking about this more, fundamentally what makes me uncomfortable about sticking with 'ann' or your suggested option, custom within-year intervals, is that it doesn't really solve the underlying issue. I think the underlying issue is that aospy does not truly support taking the time average over an arbitrary sequence of consecutive times.

Consider the edge case that @chuaxr might be interested in taking the mean over days 360 to 380 of a simulation. The first six days would be in year one, while the last fifteen would be in year two. If we went with either of the solutions proposed (@chuaxr's or yours), how might we support that? Sticking with 'ann' would not work, because aospy always groups within years to generate an annual time series, and then takes an unweighted mean across years (so because there are fewer days in year one than in year two, days in year one are weighted more heavily than they should be). I think using custom sub-monthly averaging periods is also ill-defined for the same reason (aospy would try to group things into individual years and then average).

Does that concern make sense? This would probably be best supported by a new time reduction pathway (to add to 'av', 'std', etc.) . The operation would be to just take the time-weighted average over the times that fall in between the start and end dates specified in the main script. (We'd have to think about naming; the most natural would be 'av', but that's currently taken).

spencerahill commented 6 years ago

I totally agree. Our pipeline inherently revolves around taking averages relative to calendar years: first within them, then across them. There's no way at present to support an average of a date range that spans across calendar years, and I think you're right that a new time reduction is likely needed.

Unfortunately, my gut tells me that will be a big task. Also, that seems to me a different issue than the one @chuaxr is currently facing, that of a sub-monthly period that doesn't cross calendar years. For those reasons, I'm inclined to focus on the sub-monthly issue first.

I'll open an issue to track the across-year problem, and we can turn in this thread to the sub-monthly support.