With pandas 2.0, custom time handling might not be necessary any more

mikapfl commented 1 year ago

Describe the bug

Most of the time handling functionality of scmdata is a duplication of already existing functionality in pandas. Given that pandas is a much more mature project, the scmdata functionality is also severely lacking compared to pandas.

An example is the resampling functionality of pandas. The code for resampling is super concise, and inter- and extrapolation is available. The equivalent of pandas'

ts.resample("5Min").sum()

in scmdata will be much more involved, simply because run.interpolate takes the target_times as a list of time values.

As a result, I found it much easier to convert ScmRun objects into pandas DataFrames with proper pd.DateTimeIndex'es when working with even the simplest time-related problems. The question is then: why is there a proprietary re-implementation of pandas' functionality in scmdata at all? Wouldn't it be better to get rid of it and leverage pandas' functionality?

znicholls commented 1 year ago

I think the major issue was that pandas date time length was limited to ~1760 to 2250 or something and we needed to be able to handle date times outside these dates. Now that pandas supports microsecond precision (I think), we could revisit those restrictions. We have tests for handling datetimes post 2250 and pre 1760 so if changing the back-end leaves the tests passing then I think we're good.

mikapfl commented 1 year ago

I looked again and as far as I can tell, pandas still just breaks on dates outside:

In [12]: pd.Timestamp.min
Out[12]: Timestamp('1677-09-21 00:12:43.145224193')

In [13]: pd.Timestamp.max
Out[13]: Timestamp('2262-04-11 23:47:16.854775807')

There is a workaround using periods (e.g. pd.period_range) instead, but it looses functionality, and also doesn't work for dates before 1 AD.

Now, I also remember that this was one reason to prefer xarray over pandas when we decided for primap2. Okay, not much to do, I guess any library building on pandas which needs long-range dates will need to roll their own date handling.

znicholls commented 1 year ago

Hmm sad I thought there was some pandas update but looks like https://pandas.pydata.org/docs/whatsnew/v2.0.0.html#construction-with-datetime64-or-timedelta64-dtype-with-unsupported-resolution doesn't do the job.

I thought we tried to use as much of the xarray cftime handling as possible, we probably need to revisit that...

On Tue, Jun 20, 2023 at 5:59 PM Mika Pflüger @.***> wrote:

Closed #242 https://github.com/openscm/scmdata/issues/242 as completed.

— Reply to this email directly, view it on GitHub https://github.com/openscm/scmdata/issues/242#event-9576331792, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFUH5G2EAHVP4XIIV66XF73XMFJ7LANCNFSM6AAAAAAZMHAYMU . You are receiving this because you commented.Message ID: @.***>

mikapfl commented 1 year ago

Ah, I hadn't seen the new functionality from April. Yes, I think with pandas 2.0, e.g. date_range learned the necessary tricks:

In [8]: pd.date_range(start="-1000-01-01", end="3000-01-01", freq="AS", unit="s")
Out[8]: 
DatetimeIndex(['-1000-01-01',  '-999-01-01',  '-998-01-01',  '-997-01-01',
                '-996-01-01',  '-995-01-01',  '-994-01-01',  '-993-01-01',
                '-992-01-01',  '-991-01-01',
               ...
                '2991-01-01',  '2992-01-01',  '2993-01-01',  '2994-01-01',
                '2995-01-01',  '2996-01-01',  '2997-01-01',  '2998-01-01',
                '2999-01-01',  '3000-01-01'],
              dtype='datetime64[s]', length=4001, freq='AS-JAN')

This can be used as a full-featured index. So, I think this could be used in scmdata, reducing the amount of time-related code considerably.

znicholls commented 1 year ago

Nice find

On Wed, 21 Jun 2023 at 9:23 pm, Mika Pflüger @.***> wrote:

Ah, I hadn't seen the new functionality from April. Yes, I think with pandas 2.0, e.g. date_range learned the necessary tricks:

In [8]: pd.date_range(start="-1000-01-01", end="3000-01-01", freq="AS", unit="s")Out[8]: DatetimeIndex(['-1000-01-01', '-999-01-01', '-998-01-01', '-997-01-01', '-996-01-01', '-995-01-01', '-994-01-01', '-993-01-01', '-992-01-01', '-991-01-01', ... '2991-01-01', '2992-01-01', '2993-01-01', '2994-01-01', '2995-01-01', '2996-01-01', '2997-01-01', '2998-01-01', '2999-01-01', '3000-01-01'], dtype='datetime64[s]', length=4001, freq='AS-JAN')

This can be used as a full-featured index. So, I think this could be used in scmdata, reducing the amount of time-related code considerably.

— Reply to this email directly, view it on GitHub https://github.com/openscm/scmdata/issues/242#issuecomment-1600657222, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFUH5GYPWMV3ONAVLXPOWSLXMLKRXANCNFSM6AAAAAAZMHAYMU . You are receiving this because you commented.Message ID: @.***>

znicholls commented 1 year ago

Any idea if it works with stuff that isn’t a range (eg some historical data and then only 2100, 200, 2300, 2400)?

On Wed, 21 Jun 2023 at 10:18 pm, Zebedee Nicholls < @.***> wrote:

Nice find

On Wed, 21 Jun 2023 at 9:23 pm, Mika Pflüger @.***> wrote:

Ah, I hadn't seen the new functionality from April. Yes, I think with pandas 2.0, e.g. date_range learned the necessary tricks:

In [8]: pd.date_range(start="-1000-01-01", end="3000-01-01", freq="AS", unit="s")Out[8]: DatetimeIndex(['-1000-01-01', '-999-01-01', '-998-01-01', '-997-01-01', '-996-01-01', '-995-01-01', '-994-01-01', '-993-01-01', '-992-01-01', '-991-01-01', ... '2991-01-01', '2992-01-01', '2993-01-01', '2994-01-01', '2995-01-01', '2996-01-01', '2997-01-01', '2998-01-01', '2999-01-01', '3000-01-01'], dtype='datetime64[s]', length=4001, freq='AS-JAN')

This can be used as a full-featured index. So, I think this could be used in scmdata, reducing the amount of time-related code considerably.

— Reply to this email directly, view it on GitHub https://github.com/openscm/scmdata/issues/242#issuecomment-1600657222, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFUH5GYPWMV3ONAVLXPOWSLXMLKRXANCNFSM6AAAAAAZMHAYMU . You are receiving this because you commented.Message ID: @.***>

mikapfl commented 1 year ago

Yes, DatetimeIndex supports any collection of Timestamps:

In [11]: (pd.date_range(start="1700-01-01", end="2021-01-01", freq="AS", unit="s")
    ...: .union(pd.date_range(start="2100-01-01", end="2400-01-01", periods=4, unit="s")))
Out[11]: 
DatetimeIndex(['1700-01-01', '1701-01-01', '1702-01-01', '1703-01-01',
               '1704-01-01', '1705-01-01', '1706-01-01', '1707-01-01',
               '1708-01-01', '1709-01-01',
               ...
               '2016-01-01', '2017-01-01', '2018-01-01', '2019-01-01',
               '2020-01-01', '2021-01-01', '2100-01-01', '2200-01-01',
               '2300-01-01', '2400-01-01'],
              dtype='datetime64[s]', length=326, freq=None)

Of course, you then loose frequency information on the whole thing because it doesn't have a frequency any more.

znicholls commented 1 year ago

Nice

On Thu, 22 Jun 2023 at 6:29 pm, Mika Pflüger @.***> wrote:

Yes, DatetimeIndex supports any collection of Timestamps:

In [11]: (pd.date_range(start="1700-01-01", end="2021-01-01", freq="AS", unit="s") ...: .union(pd.date_range(start="2100-01-01", end="2400-01-01", periods=4, unit="s")))Out[11]: DatetimeIndex(['1700-01-01', '1701-01-01', '1702-01-01', '1703-01-01', '1704-01-01', '1705-01-01', '1706-01-01', '1707-01-01', '1708-01-01', '1709-01-01', ... '2016-01-01', '2017-01-01', '2018-01-01', '2019-01-01', '2020-01-01', '2021-01-01', '2100-01-01', '2200-01-01', '2300-01-01', '2400-01-01'], dtype='datetime64[s]', length=326, freq=None)

Of course, you then loose frequency information on the whole thing because it doesn't have a frequency any more.

— Reply to this email directly, view it on GitHub https://github.com/openscm/scmdata/issues/242#issuecomment-1602229590, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFUH5G5FMUZ7WVZWXRTISWLXMP65LANCNFSM6AAAAAAZMHAYMU . You are receiving this because you commented.Message ID: @.***>

lewisjared commented 1 year ago

Another reason that fed into why we rolled our own was to be explicit about piecewise-linear (concentrations, ERF) and piecewise-constant (emissions) timeseries and their subtlely differences. We never really achieved that goal, but have come back to discussing it every year or so.

While ScmRun.resample is handy to interpolate missing years, it doesn't support extrapolation directly.

mikapfl commented 1 year ago

I think the difference between piecewise-linear and piecewise-constant boils down to extensive and intensive quantities, right? Any extensive quantity is piecewise constant and any intensive quantity is piecewise linear. If you resample, you have to take sums or do divisions for extensive quantities, and you calculate means or interpolate linearly for intensive quantities. In pandas, that is generalized with the concept of Resampler objects (akin to groupby), the resampling just gives you a resampler object, and you then call sum() or interpolate() on it.

I think for all kind of higher-level operations, it would be great if we could specify for a timeseries if it is extensive or intensive and then do the right thing automatically.

znicholls commented 1 year ago

I think we're on the same page but just to check:

piecewise-constant (like emissions in most cases but not always, often ERF)
- interpolation: simple within a time step (the whole thing is just constant) but trickier when it is across multiple time steps. In those cases, my would be: 1) integrate 2) do difference over each time step of interest in integrated space 3) go back to original units by dividing the cumulative value over each time step by the length of the time step (there are probably more efficient algorithms though)
- extrapolation: trickier because you need to decide whether it makes more sense to be constant (following the fact that each step is piecewise-constant) or linearly extrapolated based on more than one time step (in which case, how many time steps...)
piecewise-linear (like concentrations in almost all cases, sometimes ERF)
- interpolation: linear interpolation makes sense in most cases. The tricky part is whether you need the interpolation to be integral preserving or not. If it does need to preserve the integral, then it is hard because the outputs need to have the same integral but it isn't clear to me whether that is possible if the outputs are also piecewise-linear or whether that only works if the interpolated values 'flip' and become piecewise-constant or something else
- extrapolation: easier I think because you can use linear extrapolation just based on the last n points and something basically sensible will 'just happen' (maybe, although I'm sure there are edge cases that could be dreamt up)

Probably the simplest way to get on the same page would be to write some concrete tests..?

I think for all kind of higher-level operations, it would be great if we could specify for a timeseries if it is extensive or intensive and then do the right thing automatically

100%. If we can get users to tell us how to interpret between the discrete points then we basically know the entire continuous timeseries which makes it way easier to work with the data. We once tried to do this but then got caught in a rabbit hole and abandoned ship. Would be great one to revive one day.

mikapfl commented 1 year ago

Hi,

I think I don't really understand the problem space yet:

extensive/intensive and piecewise-constant/linear are not the same concepts. Cumulative emissions, e.g. are a clearly extensive category with respect to time, while in practice we often talk about an emissions rate (emissions per year), which is an intensive quantity with respect to time.
I don't understand the meaning piecewise-constant quantities. Why are emissions piecewise-constant? I expect if I have to downscale, i.e. interpolate emissions e.g. from yearly resolution to monthly, I have to supply weights because of seasons. And if I have to interpolate a single year, then linear interpolation between the two adjacent years is surely better than keeping it constant, absent any better weights?

Cheers,

Mika

znicholls commented 1 year ago

I think I don't really understand the problem space yet

That makes two of us :)

Some tests below which hopefully make things a bit more concrete

Two concrete tests

```python # units are a headache I haven't included at all below def test_simple_interpolation(): # we have some numbers where we just want the result to be basic # interpolation e.g. concentrations # let's pretend that all the time numbers below are seconds so we can # ignore the 'is it start of year or end of year' etc. headaches (which # are mostly dealt with in scmdata by using datetimes internally) in_times = [2010, 2015, 2020] in_dat = [400, 450, 600] out_times = [2010, 2013, 2017, 2020] exp_out_values = [ 400, # same time point hence unchanged 400 + 50 * (2013 - 2010) / (2015 - 2010), # interpolated 450 + 150 * (2017 - 2015) / (2020 - 2015), # interpolated 600, # same time point hence unchanged ] def test_piecewise_constant_integral(): # now we have some numbers that are reported emissions. As far as I can # tell, the convention with reported emissions is for them to be 'the sum # over the timestep' e.g. if Australia reports its emissions as 450 MtCO2 # in 2020, 500 MtCO2 in 2021 and 600 MtCO2 in 2022, then this means that # 450 MtCO2 was emitted in Australia over the year 2020, 500 MtCO2 in 2021 # and 600 MtCO2 in 2022. As a result, from 2020 to 2022, 1550 MtCO2 was # emitted. The only way for this to be the case if we integrate is if we # assume that the data is piecewise-constant. If we put the data on a # [2020, 2021, 2022] grid and then assume that the data is # piecewise-linear, when we integrate we wouldn't get total emissions of # 1550 MtCO2, we would get total emissions greater than that (because # drawing a straight line between the points leads to higher emissions # than using a piecewise constant assumption in this case where emissions # are rising). # (Having written this, in such a case all you really know is the # cumulative emissions at different points in time. To get back to # emissions, you have to make some assumption. If you interpolate linearly # in cumulative emissions space, you get discontinuities in emissions # space. If you do e.g. a cubic spline interpolation in cumulative # emissions space, you get smooth emissions but they can look pretty # weird. If you want some season cycle in emissions space, that would have # some implication that I can't think through for how you interpolate in # cumulative emissions space.) # All of that mumbo jumbo hopefully explains the test below in_times = [2020, 2021, 2022] in_dat = [450, 500, 600] # somehow the integral calculation needs to know to treat the data as # piecewise constant when integrating. You can make it the user's problem # (which is our current solution because we have both a cumsum and # integrate method on ScmRun), but I think we're trying to go a step # further than that here? integral = some_integral_calculation() assert integral == 1550 # the next tricky thing with data like this is what do you do when # resampling or interpolating. # If we sample up, we probably need a sum out_times = [2020, 2022] # Now our time steps are two years long, it is complicated to know what we # should put in as expected output exp = [ # spread the cumulative emissions out over two years so the # piecewise-constant assumption continues to hold? (450 + 500) / 2, # What do we do for the 2022 time point? We don't have an end year # that specifies the end of the time step so it is hard to know what # we should put here, do we just assume constant or do we do some # fancy interpolation to try and get the data trend reflected too so # this should be slightly higher than 600? 600, ] # When interpolating, there is similar ambiguity I think # (I think the same questions apply whether we're thinking about within a # year or between years e.g. interpolating from 5-year time steps to # annual but maybe I am missing something) out_times = [2020, 2020.5, 2021, 2021.5 2022] # Do we just assume constant within the time step exp = [ 450, 450, 500, 500, 600, # similar problem with not having an end bound as above ] # Do we do some integral preserving linear interpolation so it is # something like the below exp = [ <450, # the trend is up so assume more of the emissions happened in the # second half of the year >450, <500, # the trend is up so assume more of the emissions happened in the # second half of the year >500, <600, ] # Or, do we get the user to specify the weights for interpolating i.e. not # our problem? exp = [] # something based on weights e.g. weights = [1, 0, 1, 0, 1] # dump all emissions in first 6 months of year exp = [ 900, # emissions rate needs to double so integral is preserved 0, 1000, 0, 1200, ] ```

mikapfl commented 1 year ago

Hi Zeb,

The examples make a lot clearer, yes. Some thoughts:

let's clearly distinguish between interpolation (up- and downscaling) and extrapolation. Extrapolation always needs additional information, interpolation only in some cases. So, your example where data with years [2020, 2021, 2022] is resampled to two-year time steps, with interpolation we only get a single data point with the time span 2020-01-01 to 2021-12-31. To also get the second data point with the time span 2022-01-01 to 2023-12-31, we need to extrapolate, and that is something else entirely.
I think the concept you call "piecewise constant" can actually be captured nicely with extensive quantities. If we actually treat emissions as an extensive quantity, with the unit (e.g.) "Mt CO2" and dimension [mass * substance], then aggregation along the time axis does not need weights. What that means: even if you have uneven time steps (e.g. months), you can just sum everything together to get the total. If you have emitted 15 Mt CO2 in January, 12 Mt CO2 in February, and 10 Mt CO2 in March, you have emitted 37 Mt CO2 in Q1, and to calculate that you do not need to know how many days February had. On the other hand, if you want to disaggregate (downsample) from data for Q1 to monthly data, you need weights, and these weights need to be supplied by the user. If you e.g. say that the rate of emissions was constant throughout Q1 (something the user already needs to state, it is not especially probable, all things considered), you still need the relative lengths of the months.

In practice, we often treat emissions as intensive, i.e. with dimension [mass * substance / time], e.g. unit "Mt CO2 / year". I understand why, but that means we have additional complications with units. When resampling to months, do we change to the even shakier unit "Mt CO2 / month"? Years are a somewhat useful time unit, but month lengths differ by 10%. So, we could keep the unit "Mt CO2 / year" when resampling to months (which means resampling to months is very easy if you assume a constant emissions rate – you don't have to change the value at all, 1 year of 10 Mt CO2 / year equals 12 months of 10 Mt CO2 / year each). But honestly, monthly emissions in Mt CO2 / year will probably confuse people.

tl;dr: I think piecewise-constant emission rates with dimension [mass substance / time] are just extensive emissions with dimension [mass substance] in disguise. Extensive quantities can't be integrated (the result would be a quantity with dimension [mass substance time]), they have to be summed. That's why you get wrong results if you integrate piecewise-constant emission rates (neglecting the change in unit). Generalized aggregration (summation or integration) can be done if you know if your quantity is extensive or intensive.

znicholls commented 1 year ago

Very nice explanation. I am convinced. Next questions:

Do you have any thoughts on how to a) capture this and b) write a tutorial that can explain it to people so they use the API properly?
Do you think we need to introduce the concept of time bounds? Take your example: "If you have emitted 15 Mt CO2 in January, 12 Mt CO2 in February, and 10 Mt CO2 in March, you have emitted 37 Mt CO2 in Q1". Here it seems obvious that we emitted 37 Mt CO2 in Q1 because we implicitly read March as 'from March 1 to March 31'. Is it worth making that assumption explicit in our handling? That could save us some headaches I think if/when we deal with uneven steps and also in the code (you don't have to look at e.g. the delta between the last two time points and assume that delta also applies to the last time point every time). (There would be ways to abstract this from some user layers, but I suspect it would simplify handling internally if we always had bounds, particularly for intensive quantities where you need the bounds to determine e.g. cumulative emissions (although I guess also for extensive quantities because there you want to tell the user, don't change the time axis, the quantity doesn't scale with it so this probably isn't what you want))

mikapfl commented 1 year ago

Funnily enough, intensive quantities can actually be properly defined and measured in a single point. E.g. the emissions rate can be defined and measured (in principle) in a specific instant, without a range. Extensive quantities can't, they need a range. Technically, pandas has time range types, they call it periods:

In [1]: import pandas as pd

In [3]: pd.period_range(start='2017-01-01', end='2018-01-01', freq='M')
Out[3]: 
PeriodIndex(['2017-01', '2017-02', '2017-03', '2017-04', '2017-05', '2017-06',
             '2017-07', '2017-08', '2017-09', '2017-10', '2017-11', '2017-12',
             '2018-01'],
            dtype='period[M]')

In [5]: p = pd.period_range(start='2017-01-01', end='2018-01-01', freq='M')[0]

In [6]: p.start_time
Out[6]: Timestamp('2017-01-01 00:00:00')

In [7]: p.end_time
Out[7]: Timestamp('2017-01-31 23:59:59.999999999')

In [9]: pd.period_range(start='2017', end='2050', freq='2A')
Out[9]: 
PeriodIndex(['2017', '2019', '2021', '2023', '2025', '2027', '2029', '2031',
             '2033', '2035', '2037', '2039', '2041', '2043', '2045', '2047',
             '2049'],
            dtype='period[2A-DEC]')

In [10]: pd.period_range(start='2017', end='2050', freq='2A')[0].start_time
Out[10]: Timestamp('2017-01-01 00:00:00')

In [11]: pd.period_range(start='2017', end='2050', freq='2A')[0].end_time
Out[11]: Timestamp('2018-12-31 23:59:59.999999999')

I haven't worked much with them, though, so can't say if they are worth the trouble of another time datatype.

openscm / scmdata

With pandas 2.0, custom time handling might not be necessary any more #242