openscm / scmdata

Handling of Simple Climate Model data
https://scmdata.readthedocs.io
BSD 3-Clause "New" or "Revised" License
8 stars 5 forks source link

With pandas 2.0, custom time handling might not be necessary any more #242

Open mikapfl opened 1 year ago

mikapfl commented 1 year ago

Describe the bug

Most of the time handling functionality of scmdata is a duplication of already existing functionality in pandas. Given that pandas is a much more mature project, the scmdata functionality is also severely lacking compared to pandas.

An example is the resampling functionality of pandas. The code for resampling is super concise, and inter- and extrapolation is available. The equivalent of pandas'

ts.resample("5Min").sum()

in scmdata will be much more involved, simply because run.interpolate takes the target_times as a list of time values.

As a result, I found it much easier to convert ScmRun objects into pandas DataFrames with proper pd.DateTimeIndex'es when working with even the simplest time-related problems. The question is then: why is there a proprietary re-implementation of pandas' functionality in scmdata at all? Wouldn't it be better to get rid of it and leverage pandas' functionality?

znicholls commented 1 year ago

I think the major issue was that pandas date time length was limited to ~1760 to 2250 or something and we needed to be able to handle date times outside these dates. Now that pandas supports microsecond precision (I think), we could revisit those restrictions. We have tests for handling datetimes post 2250 and pre 1760 so if changing the back-end leaves the tests passing then I think we're good.

mikapfl commented 1 year ago

I looked again and as far as I can tell, pandas still just breaks on dates outside:

In [12]: pd.Timestamp.min
Out[12]: Timestamp('1677-09-21 00:12:43.145224193')

In [13]: pd.Timestamp.max
Out[13]: Timestamp('2262-04-11 23:47:16.854775807')

There is a workaround using periods (e.g. pd.period_range) instead, but it looses functionality, and also doesn't work for dates before 1 AD.

Now, I also remember that this was one reason to prefer xarray over pandas when we decided for primap2. Okay, not much to do, I guess any library building on pandas which needs long-range dates will need to roll their own date handling.

znicholls commented 1 year ago

Hmm sad I thought there was some pandas update but looks like https://pandas.pydata.org/docs/whatsnew/v2.0.0.html#construction-with-datetime64-or-timedelta64-dtype-with-unsupported-resolution doesn't do the job.

I thought we tried to use as much of the xarray cftime handling as possible, we probably need to revisit that...

On Tue, Jun 20, 2023 at 5:59 PM Mika Pflüger @.***> wrote:

Closed #242 https://github.com/openscm/scmdata/issues/242 as completed.

— Reply to this email directly, view it on GitHub https://github.com/openscm/scmdata/issues/242#event-9576331792, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFUH5G2EAHVP4XIIV66XF73XMFJ7LANCNFSM6AAAAAAZMHAYMU . You are receiving this because you commented.Message ID: @.***>

mikapfl commented 1 year ago

Ah, I hadn't seen the new functionality from April. Yes, I think with pandas 2.0, e.g. date_range learned the necessary tricks:

In [8]: pd.date_range(start="-1000-01-01", end="3000-01-01", freq="AS", unit="s")
Out[8]: 
DatetimeIndex(['-1000-01-01',  '-999-01-01',  '-998-01-01',  '-997-01-01',
                '-996-01-01',  '-995-01-01',  '-994-01-01',  '-993-01-01',
                '-992-01-01',  '-991-01-01',
               ...
                '2991-01-01',  '2992-01-01',  '2993-01-01',  '2994-01-01',
                '2995-01-01',  '2996-01-01',  '2997-01-01',  '2998-01-01',
                '2999-01-01',  '3000-01-01'],
              dtype='datetime64[s]', length=4001, freq='AS-JAN')

This can be used as a full-featured index. So, I think this could be used in scmdata, reducing the amount of time-related code considerably.

znicholls commented 1 year ago

Nice find

On Wed, 21 Jun 2023 at 9:23 pm, Mika Pflüger @.***> wrote:

Ah, I hadn't seen the new functionality from April. Yes, I think with pandas 2.0, e.g. date_range learned the necessary tricks:

In [8]: pd.date_range(start="-1000-01-01", end="3000-01-01", freq="AS", unit="s")Out[8]: DatetimeIndex(['-1000-01-01', '-999-01-01', '-998-01-01', '-997-01-01', '-996-01-01', '-995-01-01', '-994-01-01', '-993-01-01', '-992-01-01', '-991-01-01', ... '2991-01-01', '2992-01-01', '2993-01-01', '2994-01-01', '2995-01-01', '2996-01-01', '2997-01-01', '2998-01-01', '2999-01-01', '3000-01-01'], dtype='datetime64[s]', length=4001, freq='AS-JAN')

This can be used as a full-featured index. So, I think this could be used in scmdata, reducing the amount of time-related code considerably.

— Reply to this email directly, view it on GitHub https://github.com/openscm/scmdata/issues/242#issuecomment-1600657222, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFUH5GYPWMV3ONAVLXPOWSLXMLKRXANCNFSM6AAAAAAZMHAYMU . You are receiving this because you commented.Message ID: @.***>

znicholls commented 1 year ago

Any idea if it works with stuff that isn’t a range (eg some historical data and then only 2100, 200, 2300, 2400)?

On Wed, 21 Jun 2023 at 10:18 pm, Zebedee Nicholls < @.***> wrote:

Nice find

On Wed, 21 Jun 2023 at 9:23 pm, Mika Pflüger @.***> wrote:

Ah, I hadn't seen the new functionality from April. Yes, I think with pandas 2.0, e.g. date_range learned the necessary tricks:

In [8]: pd.date_range(start="-1000-01-01", end="3000-01-01", freq="AS", unit="s")Out[8]: DatetimeIndex(['-1000-01-01', '-999-01-01', '-998-01-01', '-997-01-01', '-996-01-01', '-995-01-01', '-994-01-01', '-993-01-01', '-992-01-01', '-991-01-01', ... '2991-01-01', '2992-01-01', '2993-01-01', '2994-01-01', '2995-01-01', '2996-01-01', '2997-01-01', '2998-01-01', '2999-01-01', '3000-01-01'], dtype='datetime64[s]', length=4001, freq='AS-JAN')

This can be used as a full-featured index. So, I think this could be used in scmdata, reducing the amount of time-related code considerably.

— Reply to this email directly, view it on GitHub https://github.com/openscm/scmdata/issues/242#issuecomment-1600657222, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFUH5GYPWMV3ONAVLXPOWSLXMLKRXANCNFSM6AAAAAAZMHAYMU . You are receiving this because you commented.Message ID: @.***>

mikapfl commented 1 year ago

Yes, DatetimeIndex supports any collection of Timestamps:

In [11]: (pd.date_range(start="1700-01-01", end="2021-01-01", freq="AS", unit="s")
    ...: .union(pd.date_range(start="2100-01-01", end="2400-01-01", periods=4, unit="s")))
Out[11]: 
DatetimeIndex(['1700-01-01', '1701-01-01', '1702-01-01', '1703-01-01',
               '1704-01-01', '1705-01-01', '1706-01-01', '1707-01-01',
               '1708-01-01', '1709-01-01',
               ...
               '2016-01-01', '2017-01-01', '2018-01-01', '2019-01-01',
               '2020-01-01', '2021-01-01', '2100-01-01', '2200-01-01',
               '2300-01-01', '2400-01-01'],
              dtype='datetime64[s]', length=326, freq=None)

Of course, you then loose frequency information on the whole thing because it doesn't have a frequency any more.

znicholls commented 1 year ago

Nice

On Thu, 22 Jun 2023 at 6:29 pm, Mika Pflüger @.***> wrote:

Yes, DatetimeIndex supports any collection of Timestamps:

In [11]: (pd.date_range(start="1700-01-01", end="2021-01-01", freq="AS", unit="s") ...: .union(pd.date_range(start="2100-01-01", end="2400-01-01", periods=4, unit="s")))Out[11]: DatetimeIndex(['1700-01-01', '1701-01-01', '1702-01-01', '1703-01-01', '1704-01-01', '1705-01-01', '1706-01-01', '1707-01-01', '1708-01-01', '1709-01-01', ... '2016-01-01', '2017-01-01', '2018-01-01', '2019-01-01', '2020-01-01', '2021-01-01', '2100-01-01', '2200-01-01', '2300-01-01', '2400-01-01'], dtype='datetime64[s]', length=326, freq=None)

Of course, you then loose frequency information on the whole thing because it doesn't have a frequency any more.

— Reply to this email directly, view it on GitHub https://github.com/openscm/scmdata/issues/242#issuecomment-1602229590, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFUH5G5FMUZ7WVZWXRTISWLXMP65LANCNFSM6AAAAAAZMHAYMU . You are receiving this because you commented.Message ID: @.***>

lewisjared commented 1 year ago

Another reason that fed into why we rolled our own was to be explicit about piecewise-linear (concentrations, ERF) and piecewise-constant (emissions) timeseries and their subtlely differences. We never really achieved that goal, but have come back to discussing it every year or so.

While ScmRun.resample is handy to interpolate missing years, it doesn't support extrapolation directly.

mikapfl commented 1 year ago

I think the difference between piecewise-linear and piecewise-constant boils down to extensive and intensive quantities, right? Any extensive quantity is piecewise constant and any intensive quantity is piecewise linear. If you resample, you have to take sums or do divisions for extensive quantities, and you calculate means or interpolate linearly for intensive quantities. In pandas, that is generalized with the concept of Resampler objects (akin to groupby), the resampling just gives you a resampler object, and you then call sum() or interpolate() on it.

I think for all kind of higher-level operations, it would be great if we could specify for a timeseries if it is extensive or intensive and then do the right thing automatically.

znicholls commented 1 year ago

I think we're on the same page but just to check:

Probably the simplest way to get on the same page would be to write some concrete tests..?

I think for all kind of higher-level operations, it would be great if we could specify for a timeseries if it is extensive or intensive and then do the right thing automatically

100%. If we can get users to tell us how to interpret between the discrete points then we basically know the entire continuous timeseries which makes it way easier to work with the data. We once tried to do this but then got caught in a rabbit hole and abandoned ship. Would be great one to revive one day.

mikapfl commented 1 year ago

Hi,

I think I don't really understand the problem space yet:

Cheers,

Mika

znicholls commented 1 year ago

I think I don't really understand the problem space yet

That makes two of us :)

Some tests below which hopefully make things a bit more concrete

Two concrete tests ```python # units are a headache I haven't included at all below def test_simple_interpolation(): # we have some numbers where we just want the result to be basic # interpolation e.g. concentrations # let's pretend that all the time numbers below are seconds so we can # ignore the 'is it start of year or end of year' etc. headaches (which # are mostly dealt with in scmdata by using datetimes internally) in_times = [2010, 2015, 2020] in_dat = [400, 450, 600] out_times = [2010, 2013, 2017, 2020] exp_out_values = [ 400, # same time point hence unchanged 400 + 50 * (2013 - 2010) / (2015 - 2010), # interpolated 450 + 150 * (2017 - 2015) / (2020 - 2015), # interpolated 600, # same time point hence unchanged ] def test_piecewise_constant_integral(): # now we have some numbers that are reported emissions. As far as I can # tell, the convention with reported emissions is for them to be 'the sum # over the timestep' e.g. if Australia reports its emissions as 450 MtCO2 # in 2020, 500 MtCO2 in 2021 and 600 MtCO2 in 2022, then this means that # 450 MtCO2 was emitted in Australia over the year 2020, 500 MtCO2 in 2021 # and 600 MtCO2 in 2022. As a result, from 2020 to 2022, 1550 MtCO2 was # emitted. The only way for this to be the case if we integrate is if we # assume that the data is piecewise-constant. If we put the data on a # [2020, 2021, 2022] grid and then assume that the data is # piecewise-linear, when we integrate we wouldn't get total emissions of # 1550 MtCO2, we would get total emissions greater than that (because # drawing a straight line between the points leads to higher emissions # than using a piecewise constant assumption in this case where emissions # are rising). # (Having written this, in such a case all you really know is the # cumulative emissions at different points in time. To get back to # emissions, you have to make some assumption. If you interpolate linearly # in cumulative emissions space, you get discontinuities in emissions # space. If you do e.g. a cubic spline interpolation in cumulative # emissions space, you get smooth emissions but they can look pretty # weird. If you want some season cycle in emissions space, that would have # some implication that I can't think through for how you interpolate in # cumulative emissions space.) # All of that mumbo jumbo hopefully explains the test below in_times = [2020, 2021, 2022] in_dat = [450, 500, 600] # somehow the integral calculation needs to know to treat the data as # piecewise constant when integrating. You can make it the user's problem # (which is our current solution because we have both a cumsum and # integrate method on ScmRun), but I think we're trying to go a step # further than that here? integral = some_integral_calculation() assert integral == 1550 # the next tricky thing with data like this is what do you do when # resampling or interpolating. # If we sample up, we probably need a sum out_times = [2020, 2022] # Now our time steps are two years long, it is complicated to know what we # should put in as expected output exp = [ # spread the cumulative emissions out over two years so the # piecewise-constant assumption continues to hold? (450 + 500) / 2, # What do we do for the 2022 time point? We don't have an end year # that specifies the end of the time step so it is hard to know what # we should put here, do we just assume constant or do we do some # fancy interpolation to try and get the data trend reflected too so # this should be slightly higher than 600? 600, ] # When interpolating, there is similar ambiguity I think # (I think the same questions apply whether we're thinking about within a # year or between years e.g. interpolating from 5-year time steps to # annual but maybe I am missing something) out_times = [2020, 2020.5, 2021, 2021.5 2022] # Do we just assume constant within the time step exp = [ 450, 450, 500, 500, 600, # similar problem with not having an end bound as above ] # Do we do some integral preserving linear interpolation so it is # something like the below exp = [ <450, # the trend is up so assume more of the emissions happened in the # second half of the year >450, <500, # the trend is up so assume more of the emissions happened in the # second half of the year >500, <600, ] # Or, do we get the user to specify the weights for interpolating i.e. not # our problem? exp = [] # something based on weights e.g. weights = [1, 0, 1, 0, 1] # dump all emissions in first 6 months of year exp = [ 900, # emissions rate needs to double so integral is preserved 0, 1000, 0, 1200, ] ```
mikapfl commented 1 year ago

Hi Zeb,

The examples make a lot clearer, yes. Some thoughts:

In practice, we often treat emissions as intensive, i.e. with dimension [mass * substance / time], e.g. unit "Mt CO2 / year". I understand why, but that means we have additional complications with units. When resampling to months, do we change to the even shakier unit "Mt CO2 / month"? Years are a somewhat useful time unit, but month lengths differ by 10%. So, we could keep the unit "Mt CO2 / year" when resampling to months (which means resampling to months is very easy if you assume a constant emissions rate – you don't have to change the value at all, 1 year of 10 Mt CO2 / year equals 12 months of 10 Mt CO2 / year each). But honestly, monthly emissions in Mt CO2 / year will probably confuse people.

tl;dr: I think piecewise-constant emission rates with dimension [mass substance / time] are just extensive emissions with dimension [mass substance] in disguise. Extensive quantities can't be integrated (the result would be a quantity with dimension [mass substance time]), they have to be summed. That's why you get wrong results if you integrate piecewise-constant emission rates (neglecting the change in unit). Generalized aggregration (summation or integration) can be done if you know if your quantity is extensive or intensive.

znicholls commented 1 year ago

Very nice explanation. I am convinced. Next questions:

mikapfl commented 1 year ago

Funnily enough, intensive quantities can actually be properly defined and measured in a single point. E.g. the emissions rate can be defined and measured (in principle) in a specific instant, without a range. Extensive quantities can't, they need a range. Technically, pandas has time range types, they call it periods:

In [1]: import pandas as pd

In [3]: pd.period_range(start='2017-01-01', end='2018-01-01', freq='M')
Out[3]: 
PeriodIndex(['2017-01', '2017-02', '2017-03', '2017-04', '2017-05', '2017-06',
             '2017-07', '2017-08', '2017-09', '2017-10', '2017-11', '2017-12',
             '2018-01'],
            dtype='period[M]')

In [5]: p = pd.period_range(start='2017-01-01', end='2018-01-01', freq='M')[0]

In [6]: p.start_time
Out[6]: Timestamp('2017-01-01 00:00:00')

In [7]: p.end_time
Out[7]: Timestamp('2017-01-31 23:59:59.999999999')

In [9]: pd.period_range(start='2017', end='2050', freq='2A')
Out[9]: 
PeriodIndex(['2017', '2019', '2021', '2023', '2025', '2027', '2029', '2031',
             '2033', '2035', '2037', '2039', '2041', '2043', '2045', '2047',
             '2049'],
            dtype='period[2A-DEC]')

In [10]: pd.period_range(start='2017', end='2050', freq='2A')[0].start_time
Out[10]: Timestamp('2017-01-01 00:00:00')

In [11]: pd.period_range(start='2017', end='2050', freq='2A')[0].end_time
Out[11]: Timestamp('2018-12-31 23:59:59.999999999')

I haven't worked much with them, though, so can't say if they are worth the trouble of another time datatype.