pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.41k stars 1.97k forks source link

Specialized `convert_and_replace_timezone` #12761

Closed ion-elgreco closed 10 months ago

ion-elgreco commented 11 months ago

Description

Let's say we have a dataframe with two columns. The current date in UTC and the local timezone. Currently it's not that easy to convert it first the local timezone and then replace the timezone with None without using pl.when.then with a filter.

import polars as pl
from datetime import datetime
import pytz

df = pl.DataFrame({
    'id': [1,2],
    "date_col": [datetime(2020,10,10, tzinfo=pytz.UTC), datetime(2020,10,10, tzinfo=pytz.UTC)],
    'timezone': ['Europe/London', 'Africa/Kigali']
})

exprs =[
    pl.when(pl.col('timezone')==tz)
    .then(pl.col('date_col').dt.convert_time_zone(tz).dt.replace_time_zone(None))  
    for tz in ['Europe/London', 'Africa/Kigali']
    ]

df.with_columns(
    pl.coalesce(exprs)
)

It would be nice if we could do this in one go:

df.select(local_date=pl.col('date_col').dt.convert_and_replace_timezone(convert=pl.col('timezone'), replace=None).

This way the end result is in one datetime and tz format but the intermediate step can hold different tz's.

@MarcoGorelli

MarcoGorelli commented 11 months ago

so starting from a tz-aware datetime, and a column of time zones timezone, you'd like to get the local datetime for each row according to the time zone in the timezone column

not sure about the name, but a useful way to do this does seem useful. I don't think there's currently any way to do it lazily

ion-elgreco commented 11 months ago

@MarcoGorelli correct!

Yeah this pl.when.then trick works only if you know the timezones that the data will contain ahead.

MarcoGorelli commented 11 months ago

Got a suggestion from @orlp (which I think looks really good):

I think this should be a pair of functions some variation of to_naive and from_naive

  • Expr.dt.to_naive(tz: pl.Expr) would return the local time as per the given timezone for each column, resulting in a pl.DateTime(..., time_zone=None)
  • Expr.dt.from_naive(naive_tz: pl.Expr, out_tz: str) would accept a column of datetimes with a None timezone, where each timestamp is assumed to be in the format by naive_tz and converted to a pl.DateTime(..., time_zone=out_tz)

i.e.:

new_column = pl.col('ts').dt.to_naive('time_zone')

ts                | time_zone | new_column
---------------------------------------------------------------------
datetime[us, UTC] | str       | datetime[us]
2020-01-01 +0200  | Amsterdam | 2019-12-31 23:00:00
2020-01-01 +0300  | Kathmandu | 2020-01-01 02:45:00

and

new_column = pl.col('ts').dt.from_naive('time_zone', 'UTC')

ts                  | time_zone | new_column
---------------------------------------------------------------------
datetime[us]        | str       | datetime[us, UTC]
2019-12-31 23:00:00 | Amsterdam | 2020-01-01 +0200
2020-01-01 02:45:00 | Kathmandu | 2020-01-01 +0200
ion-elgreco commented 11 months ago

@MarcoGorelli ah nice one, then we go both directions. Alright, I will add the other one as well.

I still find naive not saying much imho then I think your naive_local is more clear than if we want to mention naive

MarcoGorelli commented 11 months ago

I still find naive not saying much imho then I think your naive_local is more clear than if we want to mention naive

I think to_local_datetime is fine too, the comment was just to demonstrate the pair of functions

I'd suggest naming it like that if you prefer it, we can always do a final function name bikeshed before shipping it 😄

ion-elgreco commented 11 months ago

@MarcoGorelli is it okay to change the python expr naming only or should it be consistent across the rust and python API?

MarcoGorelli commented 11 months ago

probably best to be consistent

stinodego commented 10 months ago

I like the functionality this adds, but it also overlaps with the existing functionality, which can be confusing.

to_naive(None) would be the same as replace_time_zone(None). And from_naive(None, tz) would be the same as replace_time_zone(tz).

With regards to naming, maybe it can be called remove_time_zone/add_time_zone. That makes it clear it's about time zones and that it the input/output will be tz-aware/tz-naive or vice versa. The additional time zone expression can me thought of as "according to this time zone". Also gels well with the existing methods.

ion-elgreco commented 10 months ago

@stinodego I would say that's a small overlap :) Both target different use cases

Regarding the naming, we are not only removing or adding a timezone. We rather do convert to TZ and then replace TZ with None, and the other way around. Only saying remove or add doesn't make that entirely clear and could give different impressions of what it does.

From all the options my preference would go to the to/from_naive_local_datetime

stinodego commented 10 months ago

@MarcoGorelli any input on the naming of these functions? I find it hard to decide. In your presentation at pydata, you made a crystal clear case for when to use replace_time_zone vs convert_time zone. I feel like adding these two new methods makes things a bit muddy.

I'd like to look at the 4 method names and know what they do immediately. Maybe they shouldn't be separate methods but parameters instead? I'm not sure. Input is appreciated.

MarcoGorelli commented 10 months ago

I'm honestly not sure - you're right that it's gone from "2 clear and distinct functions" to "4 functions with some overlap"

I'm tempted to say that there two new ones could be in a Polars plugin instead, so that Polars only has the most commonly used time zone functions (and they can remain simple and clearly named), and the relatively less common case of converting from multiple time zones can be handled there?

Maybe I could rename polars-business to polars-tse (Polars Time Series Extras) and we put this in?

ion-elgreco commented 10 months ago

@MarcoGorelli happy to migrate it to your plugin

stinodego commented 10 months ago

Thanks guys. I'll be closing this then!

This functionality will be available in https://github.com/MarcoGorelli/polars-xdt (see PR https://github.com/MarcoGorelli/polars-xdt/pull/42)