pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
27.25k stars 1.67k forks source link

polars.Expr.dt.month_end vs polars.Expr.dt.offset_by #16094

Closed david-waterworth closed 1 month ago

david-waterworth commented 1 month ago

Description

I find the behavior of polars.Expr.dt.month_end a little surprising when the argument type is datetime.

i.e. it's entirely logical/unexpected that polars.Expr.dt.month_end('1-Jan-2024) is '31-Jan-2024` since if you're dealing with daily events and something occurs on 31-Jan-2024 then it obviously happened in Jan - i.e. Jan = [1-Jan-2024, 31-Jan-2024]

But I don't find the behaviour when the datatype is datetime as logical, i.e.

polars.Expr.dt.month_end('1-Jan-2024 00:00:00.000) is 31-Jan-2024 00:00:00.000

I would argue the that the right boundary January is either 31-Jan-2024 23:59:59.999 or 1-Feb-2024 00:00:00.000 - i.e. Jan != [1-Jan-2024 00:00, 31-Jan-2024 00:00], Jan = [1-Jan-2024 00:00, 31-Jan-2024 23:59] or [1-Jan-2024 00:00, 1-Feb-2024 00:00)

i.e. If you're filtering by events with timestamps based on whether the event occurred in Jan events occurring at midday on Jan 31st should be part of the interval.

I ended up having to use .dt.offset_by("1mo") instead to generate the end of month datetime's from the start of month datetimes.

In general I don't think api's should accept date and datetime as they don't represent the same things.

Also I've not looked that closely but the fact that the name is month_end implies to me that the result should always be the same time interval on the last day of the month doesn't seem to hold either - i.e. this is the example from the documentation slightly modified:

from datetime import datetime

df = pl.DataFrame(
    {
        "dates": pl.datetime_range(
            datetime(2000, 1, 1, 2),
            datetime(2000, 12, 1, 2),
            "1d",
            eager=True,
        )
    }
)
df.select(pl.col("dates").dt.month_end())

It returns the the last day of the month, but the time is 02:00 which I find a bit unusual - it'll roll datetime(2000, 1, 1, 3) into feb for example.

MarcoGorelli commented 1 month ago

hey

currently it's just "the same time, rolled forwards to the end of the month". pandas' MonthEnd does the same thing:

In [4]: from pandas.tseries.offsets import MonthEnd

In [5]: from datetime import datetime

In [6]: datetime(2020, 1, 1, 2) + MonthEnd(1)
Out[6]: Timestamp('2020-01-31 02:00:00')

, though of course this doesn't mean that Polar's can't change, I'd just like to understand the request better

i.e. If you're filtering by events with timestamps based on whether the event occurred in Jan events occurring at midday on Jan 31st should be part of the interval. I ended up having to use .dt.offset_by("1mo") instead to generate the end of month datetime's from the start of month datetimes.

Could you show your code please, just to better understand what you're using month_end for?

mcrumiller commented 1 month ago

This is not really easy to address. With the (apt) analogy of treating dates as integers and datetimes as reals, in the same way that there is no largest value x such that x < 1 (i.e. x = 0.9999...), there is no largest datetime d such that d < Feb 1.

I've had to instruct people at my company to be wary of using less-than operator when dealing with timestamps, because if you have a timestamp x and you say "WHERE x) <= 'Jan 31'" then you'll only get up through Jan 30 (unless you happen to have a timestamp at exactly Jan 31 00:00:00).

Probably overkill, but here's a visual of dates versus datetimes:

Dates

Easy, each "block" is a single item, so the month end is the last bucket:

|                 Jan                     ||                Feb                      ||
---------------------------------------------------------------------------------------------
| 1 | 2 | 3 | 4 | ... | 28 | 29 | 30 | 31 || 1 | 2 | 3 | 4 | ... | 25 | 26 | 27 | 28 || 1 | ...
  ↑                                     ↑    ↑                                     ↑
start                                  end start                                  end

Datetimes

Not easy, this is a continuous domain, so the ceil(Jan) == floor(Feb):

|                 Jan                     ||                 Feb                     ||
---------------------------------------------------------------------------------------------
| 1 | 2 | 3 | 4 | ... | 28 | 29 | 30 | 31 || 1 | 2 | 3 | 4 | ... | 25 | 26 | 27 | 28 || 1 | ...
↑                                         ↑↑                                         ↑↑
start                                  end==start                                 end==start

The "month end" of January is at midnight when the date changes from Jan 31 to Feb 1, which is at Jan 31 11:59.99999.... There's not really a good way to represent this. In polars, this is a u64 so we could simply represent it as Feb 1st - 1, but that's an implementation detail.

I vote as keeping it as-is. If the user is using monthend on datetimes they probably should have thought this through already.

david-waterworth commented 1 month ago

@mcrumiller I'm not sure I agree with your comment that "there is no largest datetime d such that d < Feb 1.". Whilst true in theory (i.e. time is continuous), in practice a datetime is always (afaik) implemented as discrete with a minimum tick size, either as an int (i.e. [ms, us, ns] ticks since some epoch) or struct (year, month, day, hours, mins, seconds, [ms, us, ns]). So given the column type you should be able to work out argmax(d) d < Feb 1 knowing the tick size. But if you assume the interval is open on the right you can just return 1-Feb 00:00:00.000

@MarcoGorelli I was originally trying to add temporal columns for grouping/filtering (i.e. month, quarter etc) and was specifically asked to return [month_start, month_end) columns so they could do further analysis. I got tripped up by polars.Expr.dt.month_start(ts) / polars.Expr.dt.month_end(ts) not working as expected. It doesn't seem logical that the start of the month is a different pl.Datetime for two different timestamps in the same month!

This works

 `pl.col("ts")
    .dt.convert_time_zone(time_zone=tzinfo.key)
    .cast(pl.Date)
    .dt.month_start()

But ended up using .group_by_dynamic("ts", every="1mo", group_by="status", include_boundaries=True, check_sorted=True) - since the user decided they didn't need the high resolution data.

So note, this does actually compute the month boundaries exactly as I wanted. I still believe for consistency polars.Expr.dt.month_start(pl.Datetime) / polars.Expr.dt.month_end(pl.Datetime) should be consistent with .group_by_dynamic("ts", every="1mo", closed="left") or .group_by_dynamic("ts", every="1mo", closed="both") or shouldn't accept pl.Datetime at all.

MarcoGorelli commented 1 month ago

and was specifically asked to return [month_start, month_end) columns so they could do further analysis

If you want the interval open on the right, then once you've got month_start, I think you should do month_start.dt.offset_by('1mo')?

or shouldn't accept pl.Datetime at all.

I think you may have a point here. The current month_end functionality for Datetimes would be quite easy to recover anyway using combine if anyone really wants that:

In [42]: df = pl.DataFrame({'a': [datetime(2020, 1, 4, 2)]})

In [43]: df.with_columns(pl.col('a').dt.date().dt.month_end().dt.combine(pl.col('a').dt.time()))
Out[43]:
shape: (1, 1)
┌─────────────────────┐
│ a                   │
│ ---                 │
│ datetime[μs]        │
╞═════════════════════╡
│ 2020-01-31 02:00:00 │
└─────────────────────┘

In [44]: df.with_columns(pl.col('a').dt.month_end())
Out[44]:
shape: (1, 1)
┌─────────────────────┐
│ a                   │
│ ---                 │
│ datetime[μs]        │
╞═════════════════════╡
│ 2020-01-31 02:00:00 │
└─────────────────────┘
david-waterworth commented 1 month ago

and was specifically asked to return [month_start, month_end) columns so they could do further analysis

If you want the interval open on the right, then once you've got month_start, I think you should do month_start.dt.offset_by('1mo')?

Yeah that's true, you really only need month_start and offset_by - similar to Excel EOMONTH(dt,-1)+1 except they only had EMONTH for some reason.

Also if there's someway of getting epsilon for a date (ie. the tick size, 1e-3, 1e-6 etc) then you could subtract this to get the closed interval. I don't know if this is a good idea (you'd have to very careful, probably not use a floating point) but it has the nice property that

month_end(pl.Date: date) == month_end(pl.DateTime: datetime).dt.date() are the same? [1]

Either that or support "24-hour clock" format - https://en.wikipedia.org/wiki/24-hour_clock :)

[1] I noticed that .dt.date() != .cast(pl.Date) if the datetime is localised, the cast appears to be equivalent to .dt.date() applied to the UTC representation, not the local representation. I wonder if there should be a warning at least if the user tried to cast a localised (non-utc) datetime - on Stack Overflow at least cast is the most accepted (wrong) solution to converting a datetime to date ? I can open a new issue if you agree?

mcrumiller commented 1 month ago

Whilst true in theory (i.e. time is continuous), in practice a datetime is always (afaik) implemented as discrete with a minimum tick size

@david-waterworth I addressed this when I mentioned:

In polars, this is a u64 so we could simply represent it as Feb 1st - 1, but that's an implementation detail.

But regardless, having a month_end() on datetimes would also warrant having a day_end() as well. These both just feel messy to me, unless we simply provide 12:00am at the start of the next day. In other words, month_end() is kinda the same as ceil(x). And good point Marco about the quarter_end / year_start etc.

david-waterworth commented 1 month ago

@mcrumiller yeah sorry I missed that.

I'd be fine with 12:00am at the start of the first day of the following interval for all {interval}_end(pl.Datetime)

I guess a second question is whether you should overload or have separate/explicit Date and Datetime functions - but I assume there are other instances where Expr.dt overloads so that's probably not something you want to change now.

And is it actually possible (for a python user) to calculate Feb 1st - 1 (time_unit) at the moment? Using dt.offset_by for example you'd need to use either -1ms, -1us or -1ns depending on the time_unit. I don't see a Expr.dt.time_unit function which would make this easy. I don't think it's important as I usually explicitly set the time_unit when I create the column (or if not I can assume it's ‘ns’). Just an observation.

MarcoGorelli commented 1 month ago

12:00am at the start of the first day of the following interval

Sorry I still don't see why this is needed over .dt.offset_by(interval)

month_end was introduced to give an easy way to get the last day of the month. For Date, I think this needs to stay. For Datetime, wouldn't it be very surprising to output a different day of the month?

And is it actually possible (for a python user) to calculate Feb 1st - 1 (time_unit) at the moment?

Here's one way:

df.select(pl.col('a').dt.offset_by('1mo').cast(pl.Int64).sub(1).cast(df.schema['a']))
stinodego commented 1 month ago

Just wanted to chime in that indeed the existing behavior of keeping the same time indication is strange to me.

I would expect the latest time possible, i.e. 23:59:59.999999 (depending on time unit).

MarcoGorelli commented 1 month ago

from discussion - returning the last moment of the month would be too error-prone, and this isn't worth changing. am putting up a pr to just document it better