Closed david-waterworth closed 1 month ago
hey
currently it's just "the same time, rolled forwards to the end of the month". pandas' MonthEnd
does the same thing:
In [4]: from pandas.tseries.offsets import MonthEnd
In [5]: from datetime import datetime
In [6]: datetime(2020, 1, 1, 2) + MonthEnd(1)
Out[6]: Timestamp('2020-01-31 02:00:00')
, though of course this doesn't mean that Polar's can't change, I'd just like to understand the request better
i.e. If you're filtering by events with timestamps based on whether the event occurred in Jan events occurring at midday on Jan 31st should be part of the interval. I ended up having to use .dt.offset_by("1mo") instead to generate the end of month datetime's from the start of month datetimes.
Could you show your code please, just to better understand what you're using month_end
for?
This is not really easy to address. With the (apt) analogy of treating dates as integers and datetimes as reals, in the same way that there is no largest value x
such that x < 1
(i.e. x = 0.9999...), there is no largest datetime d
such that d < Feb 1
.
I've had to instruct people at my company to be wary of using less-than operator when dealing with timestamps, because if you have a timestamp x
and you say "WHERE x) <= 'Jan 31'" then you'll only get up through Jan 30 (unless you happen to have a timestamp at exactly Jan 31 00:00:00).
Probably overkill, but here's a visual of dates versus datetimes:
Easy, each "block" is a single item, so the month end is the last bucket:
| Jan || Feb ||
---------------------------------------------------------------------------------------------
| 1 | 2 | 3 | 4 | ... | 28 | 29 | 30 | 31 || 1 | 2 | 3 | 4 | ... | 25 | 26 | 27 | 28 || 1 | ...
↑ ↑ ↑ ↑
start end start end
Not easy, this is a continuous domain, so the ceil(Jan) == floor(Feb):
| Jan || Feb ||
---------------------------------------------------------------------------------------------
| 1 | 2 | 3 | 4 | ... | 28 | 29 | 30 | 31 || 1 | 2 | 3 | 4 | ... | 25 | 26 | 27 | 28 || 1 | ...
↑ ↑↑ ↑↑
start end==start end==start
The "month end" of January is at midnight when the date changes from Jan 31 to Feb 1, which is at Jan 31 11:59.99999.... There's not really a good way to represent this. In polars, this is a u64
so we could simply represent it as Feb 1st - 1
, but that's an implementation detail.
I vote as keeping it as-is. If the user is using monthend
on datetimes they probably should have thought this through already.
@mcrumiller I'm not sure I agree with your comment that "there is no largest datetime d such that d < Feb 1.". Whilst true in theory (i.e. time is continuous), in practice a datetime is always (afaik) implemented as discrete with a minimum tick size, either as an int (i.e. [ms, us, ns] ticks since some epoch) or struct (year, month, day, hours, mins, seconds, [ms, us, ns]). So given the column type you should be able to work out argmax(d) d < Feb 1
knowing the tick size. But if you assume the interval is open on the right you can just return 1-Feb 00:00:00.000
@MarcoGorelli I was originally trying to add temporal columns for grouping/filtering (i.e. month, quarter etc) and was specifically asked to return [month_start, month_end) columns so they could do further analysis. I got tripped up by polars.Expr.dt.month_start(ts)
/ polars.Expr.dt.month_end(ts)
not working as expected. It doesn't seem logical that the start of the month is a different pl.Datetime for two different timestamps in the same month!
This works
`pl.col("ts")
.dt.convert_time_zone(time_zone=tzinfo.key)
.cast(pl.Date)
.dt.month_start()
But ended up using .group_by_dynamic("ts", every="1mo", group_by="status", include_boundaries=True, check_sorted=True)
- since the user decided they didn't need the high resolution data.
So note, this does actually compute the month boundaries exactly as I wanted. I still believe for consistency polars.Expr.dt.month_start(pl.Datetime)
/ polars.Expr.dt.month_end(pl.Datetime)
should be consistent with .group_by_dynamic("ts", every="1mo", closed="left")
or .group_by_dynamic("ts", every="1mo", closed="both")
or shouldn't accept pl.Datetime at all.
and was specifically asked to return [month_start, month_end) columns so they could do further analysis
If you want the interval open on the right, then once you've got month_start
, I think you should do month_start.dt.offset_by('1mo')
?
or shouldn't accept pl.Datetime at all.
I think you may have a point here. The current month_end
functionality for Datetime
s would be quite easy to recover anyway using combine
if anyone really wants that:
In [42]: df = pl.DataFrame({'a': [datetime(2020, 1, 4, 2)]})
In [43]: df.with_columns(pl.col('a').dt.date().dt.month_end().dt.combine(pl.col('a').dt.time()))
Out[43]:
shape: (1, 1)
┌─────────────────────┐
│ a │
│ --- │
│ datetime[μs] │
╞═════════════════════╡
│ 2020-01-31 02:00:00 │
└─────────────────────┘
In [44]: df.with_columns(pl.col('a').dt.month_end())
Out[44]:
shape: (1, 1)
┌─────────────────────┐
│ a │
│ --- │
│ datetime[μs] │
╞═════════════════════╡
│ 2020-01-31 02:00:00 │
└─────────────────────┘
and was specifically asked to return [month_start, month_end) columns so they could do further analysis
If you want the interval open on the right, then once you've got
month_start
, I think you should domonth_start.dt.offset_by('1mo')
?
Yeah that's true, you really only need month_start
and offset_by
- similar to Excel EOMONTH(dt,-1)+1
except they only had EMONTH for some reason.
Also if there's someway of getting epsilon
for a date (ie. the tick size, 1e-3, 1e-6 etc) then you could subtract this to get the closed interval. I don't know if this is a good idea (you'd have to very careful, probably not use a floating point) but it has the nice property that
month_end(pl.Date: date) == month_end(pl.DateTime: datetime).dt.date()
are the same? [1]
Either that or support "24-hour clock" format - https://en.wikipedia.org/wiki/24-hour_clock :)
[1] I noticed that .dt.date() != .cast(pl.Date)
if the datetime is localised, the cast appears to be equivalent to .dt.date()
applied to the UTC representation, not the local representation. I wonder if there should be a warning at least if the user tried to cast a localised (non-utc) datetime - on Stack Overflow at least cast
is the most accepted (wrong) solution
to converting a datetime to date ? I can open a new issue if you agree?
Whilst true in theory (i.e. time is continuous), in practice a datetime is always (afaik) implemented as discrete with a minimum tick size
@david-waterworth I addressed this when I mentioned:
In polars, this is a u64 so we could simply represent it as Feb 1st - 1, but that's an implementation detail.
But regardless, having a month_end()
on datetimes would also warrant having a day_end()
as well. These both just feel messy to me, unless we simply provide 12:00am at the start of the next day. In other words, month_end()
is kinda the same as ceil(x)
. And good point Marco about the quarter_end / year_start etc.
@mcrumiller yeah sorry I missed that.
I'd be fine with 12:00am at the start of the first day of the following interval for all {interval}_end(pl.Datetime)
I guess a second question is whether you should overload or have separate/explicit Date and Datetime functions - but I assume there are other instances where Expr.dt overloads so that's probably not something you want to change now.
And is it actually possible (for a python user) to calculate Feb 1st - 1 (time_unit)
at the moment? Using dt.offset_by
for example you'd need to use either -1ms
, -1us
or -1ns
depending on the time_unit
. I don't see a Expr.dt.time_unit
function which would make this easy. I don't think it's important as I usually explicitly set the time_unit when I create the column (or if not I can assume it's ‘ns’). Just an observation.
12:00am at the start of the first day of the following interval
Sorry I still don't see why this is needed over .dt.offset_by(interval)
month_end
was introduced to give an easy way to get the last day of the month. For Date
, I think this needs to stay. For Datetime
, wouldn't it be very surprising to output a different day of the month?
And is it actually possible (for a python user) to calculate Feb 1st - 1 (time_unit) at the moment?
Here's one way:
df.select(pl.col('a').dt.offset_by('1mo').cast(pl.Int64).sub(1).cast(df.schema['a']))
Just wanted to chime in that indeed the existing behavior of keeping the same time indication is strange to me.
I would expect the latest time possible, i.e. 23:59:59.999999
(depending on time unit).
from discussion - returning the last moment of the month would be too error-prone, and this isn't worth changing. am putting up a pr to just document it better
Description
I find the behavior of
polars.Expr.dt.month_end
a little surprising when the argument type isdatetime
.i.e. it's entirely logical/unexpected that
polars.Expr.dt.month_end('1-Jan-2024
) is '31-Jan-2024` since if you're dealing with daily events and something occurs on 31-Jan-2024 then it obviously happened in Jan - i.e. Jan = [1-Jan-2024, 31-Jan-2024]But I don't find the behaviour when the datatype is datetime as logical, i.e.
polars.Expr.dt.month_end('1-Jan-2024 00:00:00.000
) is31-Jan-2024 00:00:00.000
I would argue the that the right boundary January is either
31-Jan-2024 23:59:59.999
or1-Feb-2024 00:00:00.000
- i.e. Jan != [1-Jan-2024 00:00, 31-Jan-2024 00:00], Jan = [1-Jan-2024 00:00, 31-Jan-2024 23:59] or [1-Jan-2024 00:00, 1-Feb-2024 00:00)i.e. If you're filtering by events with timestamps based on whether the event occurred in Jan events occurring at midday on Jan 31st should be part of the interval.
I ended up having to use
.dt.offset_by("1mo")
instead to generate the end of month datetime's from the start of month datetimes.In general I don't think api's should accept date and datetime as they don't represent the same things.
Also I've not looked that closely but the fact that the name is
month_end
implies to me that the result should always be the same time interval on the last day of the month doesn't seem to hold either - i.e. this is the example from the documentation slightly modified:It returns the the last day of the month, but the time is 02:00 which I find a bit unusual - it'll roll datetime(2000, 1, 1, 3) into feb for example.