Open mikapfl opened 1 year ago
I think the major issue was that pandas date time length was limited to ~1760 to 2250 or something and we needed to be able to handle date times outside these dates. Now that pandas supports microsecond precision (I think), we could revisit those restrictions. We have tests for handling datetimes post 2250 and pre 1760 so if changing the back-end leaves the tests passing then I think we're good.
I looked again and as far as I can tell, pandas still just breaks on dates outside:
In [12]: pd.Timestamp.min
Out[12]: Timestamp('1677-09-21 00:12:43.145224193')
In [13]: pd.Timestamp.max
Out[13]: Timestamp('2262-04-11 23:47:16.854775807')
There is a workaround using periods (e.g. pd.period_range
) instead, but it looses functionality, and also doesn't work for dates before 1 AD.
Now, I also remember that this was one reason to prefer xarray over pandas when we decided for primap2. Okay, not much to do, I guess any library building on pandas which needs long-range dates will need to roll their own date handling.
Hmm sad I thought there was some pandas update but looks like https://pandas.pydata.org/docs/whatsnew/v2.0.0.html#construction-with-datetime64-or-timedelta64-dtype-with-unsupported-resolution doesn't do the job.
I thought we tried to use as much of the xarray cftime handling as possible, we probably need to revisit that...
On Tue, Jun 20, 2023 at 5:59 PM Mika Pflüger @.***> wrote:
Closed #242 https://github.com/openscm/scmdata/issues/242 as completed.
— Reply to this email directly, view it on GitHub https://github.com/openscm/scmdata/issues/242#event-9576331792, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFUH5G2EAHVP4XIIV66XF73XMFJ7LANCNFSM6AAAAAAZMHAYMU . You are receiving this because you commented.Message ID: @.***>
Ah, I hadn't seen the new functionality from April. Yes, I think with pandas 2.0, e.g. date_range learned the necessary tricks:
In [8]: pd.date_range(start="-1000-01-01", end="3000-01-01", freq="AS", unit="s")
Out[8]:
DatetimeIndex(['-1000-01-01', '-999-01-01', '-998-01-01', '-997-01-01',
'-996-01-01', '-995-01-01', '-994-01-01', '-993-01-01',
'-992-01-01', '-991-01-01',
...
'2991-01-01', '2992-01-01', '2993-01-01', '2994-01-01',
'2995-01-01', '2996-01-01', '2997-01-01', '2998-01-01',
'2999-01-01', '3000-01-01'],
dtype='datetime64[s]', length=4001, freq='AS-JAN')
This can be used as a full-featured index. So, I think this could be used in scmdata, reducing the amount of time-related code considerably.
Nice find
On Wed, 21 Jun 2023 at 9:23 pm, Mika Pflüger @.***> wrote:
Ah, I hadn't seen the new functionality from April. Yes, I think with pandas 2.0, e.g. date_range learned the necessary tricks:
In [8]: pd.date_range(start="-1000-01-01", end="3000-01-01", freq="AS", unit="s")Out[8]: DatetimeIndex(['-1000-01-01', '-999-01-01', '-998-01-01', '-997-01-01', '-996-01-01', '-995-01-01', '-994-01-01', '-993-01-01', '-992-01-01', '-991-01-01', ... '2991-01-01', '2992-01-01', '2993-01-01', '2994-01-01', '2995-01-01', '2996-01-01', '2997-01-01', '2998-01-01', '2999-01-01', '3000-01-01'], dtype='datetime64[s]', length=4001, freq='AS-JAN')
This can be used as a full-featured index. So, I think this could be used in scmdata, reducing the amount of time-related code considerably.
— Reply to this email directly, view it on GitHub https://github.com/openscm/scmdata/issues/242#issuecomment-1600657222, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFUH5GYPWMV3ONAVLXPOWSLXMLKRXANCNFSM6AAAAAAZMHAYMU . You are receiving this because you commented.Message ID: @.***>
Any idea if it works with stuff that isn’t a range (eg some historical data and then only 2100, 200, 2300, 2400)?
On Wed, 21 Jun 2023 at 10:18 pm, Zebedee Nicholls < @.***> wrote:
Nice find
On Wed, 21 Jun 2023 at 9:23 pm, Mika Pflüger @.***> wrote:
Ah, I hadn't seen the new functionality from April. Yes, I think with pandas 2.0, e.g. date_range learned the necessary tricks:
In [8]: pd.date_range(start="-1000-01-01", end="3000-01-01", freq="AS", unit="s")Out[8]: DatetimeIndex(['-1000-01-01', '-999-01-01', '-998-01-01', '-997-01-01', '-996-01-01', '-995-01-01', '-994-01-01', '-993-01-01', '-992-01-01', '-991-01-01', ... '2991-01-01', '2992-01-01', '2993-01-01', '2994-01-01', '2995-01-01', '2996-01-01', '2997-01-01', '2998-01-01', '2999-01-01', '3000-01-01'], dtype='datetime64[s]', length=4001, freq='AS-JAN')
This can be used as a full-featured index. So, I think this could be used in scmdata, reducing the amount of time-related code considerably.
— Reply to this email directly, view it on GitHub https://github.com/openscm/scmdata/issues/242#issuecomment-1600657222, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFUH5GYPWMV3ONAVLXPOWSLXMLKRXANCNFSM6AAAAAAZMHAYMU . You are receiving this because you commented.Message ID: @.***>
Yes, DatetimeIndex supports any collection of Timestamps:
In [11]: (pd.date_range(start="1700-01-01", end="2021-01-01", freq="AS", unit="s")
...: .union(pd.date_range(start="2100-01-01", end="2400-01-01", periods=4, unit="s")))
Out[11]:
DatetimeIndex(['1700-01-01', '1701-01-01', '1702-01-01', '1703-01-01',
'1704-01-01', '1705-01-01', '1706-01-01', '1707-01-01',
'1708-01-01', '1709-01-01',
...
'2016-01-01', '2017-01-01', '2018-01-01', '2019-01-01',
'2020-01-01', '2021-01-01', '2100-01-01', '2200-01-01',
'2300-01-01', '2400-01-01'],
dtype='datetime64[s]', length=326, freq=None)
Of course, you then loose frequency information on the whole thing because it doesn't have a frequency any more.
Nice
On Thu, 22 Jun 2023 at 6:29 pm, Mika Pflüger @.***> wrote:
Yes, DatetimeIndex supports any collection of Timestamps:
In [11]: (pd.date_range(start="1700-01-01", end="2021-01-01", freq="AS", unit="s") ...: .union(pd.date_range(start="2100-01-01", end="2400-01-01", periods=4, unit="s")))Out[11]: DatetimeIndex(['1700-01-01', '1701-01-01', '1702-01-01', '1703-01-01', '1704-01-01', '1705-01-01', '1706-01-01', '1707-01-01', '1708-01-01', '1709-01-01', ... '2016-01-01', '2017-01-01', '2018-01-01', '2019-01-01', '2020-01-01', '2021-01-01', '2100-01-01', '2200-01-01', '2300-01-01', '2400-01-01'], dtype='datetime64[s]', length=326, freq=None)
Of course, you then loose frequency information on the whole thing because it doesn't have a frequency any more.
— Reply to this email directly, view it on GitHub https://github.com/openscm/scmdata/issues/242#issuecomment-1602229590, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFUH5G5FMUZ7WVZWXRTISWLXMP65LANCNFSM6AAAAAAZMHAYMU . You are receiving this because you commented.Message ID: @.***>
Another reason that fed into why we rolled our own was to be explicit about piecewise-linear (concentrations, ERF) and piecewise-constant (emissions) timeseries and their subtlely differences. We never really achieved that goal, but have come back to discussing it every year or so.
While ScmRun.resample
is handy to interpolate missing years, it doesn't support extrapolation directly.
I think the difference between piecewise-linear and piecewise-constant boils down to extensive and intensive quantities, right? Any extensive quantity is piecewise constant and any intensive quantity is piecewise linear. If you resample, you have to take sums or do divisions for extensive quantities, and you calculate means or interpolate linearly for intensive quantities. In pandas, that is generalized with the concept of Resampler objects (akin to groupby), the resampling just gives you a resampler object, and you then call sum()
or interpolate()
on it.
I think for all kind of higher-level operations, it would be great if we could specify for a timeseries if it is extensive or intensive and then do the right thing automatically.
I think we're on the same page but just to check:
Probably the simplest way to get on the same page would be to write some concrete tests..?
I think for all kind of higher-level operations, it would be great if we could specify for a timeseries if it is extensive or intensive and then do the right thing automatically
100%. If we can get users to tell us how to interpret between the discrete points then we basically know the entire continuous timeseries which makes it way easier to work with the data. We once tried to do this but then got caught in a rabbit hole and abandoned ship. Would be great one to revive one day.
Hi,
I think I don't really understand the problem space yet:
Cheers,
Mika
I think I don't really understand the problem space yet
That makes two of us :)
Some tests below which hopefully make things a bit more concrete
Hi Zeb,
The examples make a lot clearer, yes. Some thoughts:
In practice, we often treat emissions as intensive, i.e. with dimension [mass * substance / time], e.g. unit "Mt CO2 / year". I understand why, but that means we have additional complications with units. When resampling to months, do we change to the even shakier unit "Mt CO2 / month"? Years are a somewhat useful time unit, but month lengths differ by 10%. So, we could keep the unit "Mt CO2 / year" when resampling to months (which means resampling to months is very easy if you assume a constant emissions rate – you don't have to change the value at all, 1 year of 10 Mt CO2 / year equals 12 months of 10 Mt CO2 / year each). But honestly, monthly emissions in Mt CO2 / year will probably confuse people.
tl;dr: I think piecewise-constant emission rates with dimension [mass substance / time] are just extensive emissions with dimension [mass substance] in disguise. Extensive quantities can't be integrated (the result would be a quantity with dimension [mass substance time]), they have to be summed. That's why you get wrong results if you integrate piecewise-constant emission rates (neglecting the change in unit). Generalized aggregration (summation or integration) can be done if you know if your quantity is extensive or intensive.
Very nice explanation. I am convinced. Next questions:
Funnily enough, intensive quantities can actually be properly defined and measured in a single point. E.g. the emissions rate can be defined and measured (in principle) in a specific instant, without a range. Extensive quantities can't, they need a range. Technically, pandas has time range types, they call it periods:
In [1]: import pandas as pd
In [3]: pd.period_range(start='2017-01-01', end='2018-01-01', freq='M')
Out[3]:
PeriodIndex(['2017-01', '2017-02', '2017-03', '2017-04', '2017-05', '2017-06',
'2017-07', '2017-08', '2017-09', '2017-10', '2017-11', '2017-12',
'2018-01'],
dtype='period[M]')
In [5]: p = pd.period_range(start='2017-01-01', end='2018-01-01', freq='M')[0]
In [6]: p.start_time
Out[6]: Timestamp('2017-01-01 00:00:00')
In [7]: p.end_time
Out[7]: Timestamp('2017-01-31 23:59:59.999999999')
In [9]: pd.period_range(start='2017', end='2050', freq='2A')
Out[9]:
PeriodIndex(['2017', '2019', '2021', '2023', '2025', '2027', '2029', '2031',
'2033', '2035', '2037', '2039', '2041', '2043', '2045', '2047',
'2049'],
dtype='period[2A-DEC]')
In [10]: pd.period_range(start='2017', end='2050', freq='2A')[0].start_time
Out[10]: Timestamp('2017-01-01 00:00:00')
In [11]: pd.period_range(start='2017', end='2050', freq='2A')[0].end_time
Out[11]: Timestamp('2018-12-31 23:59:59.999999999')
I haven't worked much with them, though, so can't say if they are worth the trouble of another time datatype.
Describe the bug
Most of the time handling functionality of
scmdata
is a duplication of already existing functionality inpandas
. Given thatpandas
is a much more mature project, thescmdata
functionality is also severely lacking compared topandas
.An example is the resampling functionality of pandas. The code for resampling is super concise, and inter- and extrapolation is available. The equivalent of pandas'
in scmdata will be much more involved, simply because
run.interpolate
takes thetarget_times
as a list of time values.As a result, I found it much easier to convert ScmRun objects into pandas DataFrames with proper pd.DateTimeIndex'es when working with even the simplest time-related problems. The question is then: why is there a proprietary re-implementation of pandas' functionality in scmdata at all? Wouldn't it be better to get rid of it and leverage pandas' functionality?