narwhals-dev / narwhals

Lightweight and extensible compatibility layer between dataframe libraries!
https://narwhals-dev.github.io/narwhals/
MIT License
615 stars 91 forks source link

selectors.by_dtype doesn't select tz-aware columns #1313

Open MarcoGorelli opened 3 weeks ago

MarcoGorelli commented 3 weeks ago
import polars as pl
import pandas as pd
import narwhals as nw
import pyarrow as pa
import numpy as np
from datetime import datetime

data = {'a': [datetime(2020, 1, 1), datetime(2020, 1, 2)], 'c': [4,5]}

df = nw.from_native(pd.DataFrame(data)).with_columns(b=nw.col('a').dt.replace_time_zone('Asia/Katmandu'))
print(df.select(nw.selectors.by_dtype(nw.Datetime)).to_native())

df = nw.from_native(pl.DataFrame(data)).with_columns(b=nw.col('a').dt.replace_time_zone('Asia/Katmandu'))
print(df.select(nw.selectors.by_dtype(nw.Datetime)).to_native())

this outputs

           a                         b
0 2020-01-01 2020-01-01 00:00:00+05:45
1 2020-01-02 2020-01-02 00:00:00+05:45
shape: (2, 1)
┌─────────────────────┐
│ a                   │
│ ---                 │
│ datetime[μs]        │
╞═════════════════════╡
│ 2020-01-01 00:00:00 │
│ 2020-01-02 00:00:00 │
└─────────────────────┘

For Polars, both columns 'a' and 'b' should be selected

MarcoGorelli commented 3 weeks ago

Actually, this matches what Polars does with

df = pl.DataFrame(data).with_columns(b=pl.col('a').dt.replace_time_zone('Asia/Katmandu'))
print(df.select(pl.selectors.by_dtype(pl.Datetime)))

The Polars way to do this would be:

df = pl.DataFrame(data).with_columns(b=pl.col('a').dt.replace_time_zone('Asia/Katmandu'))
print(df.select(pl.selectors.by_dtype(pl.Datetime, pl.Datetime(time_zone='*'))))

tbh I think it's a bit odd in Polars, I'd expect every variation of Datetime to match by_dtype(pl.Datetime). I'll check if this is actually desirable / correct in Polars before moving forwards here