vega / altair

Declarative statistical visualization library for Python
https://altair-viz.github.io/
BSD 3-Clause "New" or "Revised" License
9.4k stars 795 forks source link

Polars Dataframe with "date" dtype would not be rendered by Altair. Two workarounds (with and without pandas) are proposed. #3280

Closed ariewjy closed 8 months ago

ariewjy commented 11 months ago

Issue when altair (version 5.2.0) chart would not render if any of the column is in "date" temporal data type.

I found that whenever there is a date data type column in the polars dataframe, the altair would not render the chart properly. Only when there is no date column, either by removing it completely, or by casting it to pandas or different data type as below.

Code to reproduce issue

import polars as pl
import altair as alt

# Importing data from vegadataset via url

dfx = pl.read_csv( "https://raw.githubusercontent.com/vega/vega-datasets/main/data/seattle-weather.csv",
try_parse_dates=True)

dfx

CleanShot 2023-12-10 at 18 43 55@2x

As shown above, try_parse_dates=True the date column will be "date" format.

Using altair 5.2.0 version, if plotted using rect encoding the output will not be rendered.

import altair as alt

alt.Chart(dfx).mark_rect().encode(
    alt.X("date(date):O").axis(labelAngle=0, format="%e").title("Day"),
    alt.Y("month(date):O").title("Month"),
    alt.Color("max(temp_max):Q", scale=alt.Scale(scheme="redblue")).title("Max Temp"),
)

CleanShot 2023-12-10 at 14 17 58@2x

There are two workarounds to avoid this issue:

1. Convert to pandas dataframe using to_pandas() before using altair.

This means the altair will use the familiar format (pandas dataframe) instead of polars, however that means we wouldn't get the benefit of using polars dataframe later on (e.g. speed, etc.).

Perhaps works best if we are not planning to do any data wrangling with polars later on.

dfx = (
    pl.read_csv(
        "https://raw.githubusercontent.com/vega/vega-datasets/main/data/seattle-weather.csv",
        try_parse_dates=True
    )
    .to_pandas()
)

dfx

CleanShot 2023-12-10 at 18 44 23@2x

Using the same code to make a chart from altair would work:

CleanShot 2023-12-10 at 18 44 57@2x

However, if we plan to use polars as the main dataframe library, then the second method is better.

2. Parse the date column to datetime type instead of date data type.

Casting the str type to datetime[μs] before we make the plot in altair.

This works when the try_parse_dates=False (which is the default setting in read_csv)

dfx = (
    pl.read_csv(
        "https://raw.githubusercontent.com/vega/vega-datasets/main/data/seattle-weather.csv")
    .with_columns(pl.col("date").str.to_datetime())
)

dfx.head()

CleanShot 2023-12-10 at 18 50 40@2x

Then just do the same chart plotting using altair would then work.

CleanShot 2023-12-10 at 18 51 22@2x

jonmmease commented 11 months ago

Thanks for the report @ariewjy, and apologies for the slow response. I'll have to look more closely, but this looks like a limitation in pyarrow's implementation of the DataFrame interchange protocol. We should be able to work around it, but will take some experimentation to see where it makes sense to do this.

jonmmease commented 8 months ago

So I looked into this a bit more, and found that pyarrow doesn't support loading Date32 columns through the dataframe interchange protocol. https://github.com/apache/arrow/issues/39539. I posted a repro of this error in that thread and asked what the best path forward is.

We could work around this by introducing an optional dependency on polars, but we were hoping we could avoid this and focus only on supporting the dataframe interchange protocol. But I would very much like Altair to work smoothly with Polars, so depending on how likely this is to be resolved in pyarrow, we may need to do this in the short term.

cc @mattijn in case you have thoughts

mattijn commented 8 months ago

If the dataframe interchange protocol is undecisive how to support this then we should solve it pragmatic. There is not always a royal road. I'm not in favor to introduce an optional polars dependency just to do an isinstance check on the polars dataframe.

We can do a simple check on the return type of the instance of the .__dataframe__ and if it is a PolarsDataFrame we know we can call the .to_arrow() function directly without entering the more complicated and slower dataframe interchange protocol.

# add new function
def dataframe_instance(obj):
    return obj.__dataframe__.__annotations__["return"]

# adaptations required in https://github.com/altair-viz/altair/blob/main/altair/utils/data.py 
if hasattr(data, "__dataframe__"):
    if dataframe_instance(data) == 'PolarsDataFrame':
        pa_table = data.to_arrow()
    else:
        pi = import_pyarrow_interchange()
        pa_table = pi.from_dataframe(data)

By the way for me this issue won't raise an error and also not the example you provided in your comment at the pyarrow repository:

import pyarrow as pa
import polars as pl
import pyarrow.interchange as pi

print(pa.__version__, pl.__version__)

data = pl.DataFrame({"date": [datetime.date(2024, 3, 22)]})
pi.from_dataframe(data)
12.0.1 0.20.16

pyarrow.Table
date: int32
----
date: [[19804]]

Where:

data.to_arrow()
pyarrow.Table
date: date32[day]
----
date: [[2024-03-22]]

Even more reason to not use the dataframe interchange protocol here..

mattijn commented 8 months ago

Hm, that function I propose will not always work (eg from this issue: https://github.com/vega/vegafusion/issues/386):

import pandas as pd
import vega_datasets

from pandas.core.interchange.dataframe import PandasDataFrameXchg

class NoisyDfInterface(pd.core.interchange.dataframe.PandasDataFrameXchg):
    def __dataframe__(self, allow_copy: bool = True):
        return NoisyDfInterface(self._df, allow_copy=allow_copy)

    def get_column_by_name(self, name):
        print(f"get_column_by_name('{name}')")
        return super().get_column_by_name(name)

cars = vega_datasets.data.cars()
dfy = NoisyDfInterface(cars)

type(dfy).__name__
'NoisyDfInterface'

But:

dataframe_instance(dfy)
----> [7]    return obj.__dataframe__.__annotations__["return"]

KeyError: 'return'

Initially I thought we can use type(dfy).__name__, but for both a pandas DataFrame and a polars DataFrame will this return 'DataFrame'

mattijn commented 8 months ago

It is getting messy:

if hasattr(data, "__dataframe__"):
    if 'polars' in type(data).__module__:
        pa_table = data.to_arrow()

Open for other suggestions @jonmmease 😄

jonmmease commented 8 months ago

Haha, yeah, that would work. Another option is to lean into duck typing and just check for the existence of a data.to_arrow() method on the object and call that instead of using the __dataframe__ interface.

Looks like cudf has the same method: https://docs.rapids.ai/api/cudf/legacy/user_guide/api_docs/api/cudf.dataframe.to_arrow/

I think vaex uses to_arrow_table(), and duckdb supports .arrow() and .to_arrow_table(). So maybe we check for the existence of methods named "arrow", "to_arrow", or "to_arrow_table" and call one of these if they exist.

mattijn commented 8 months ago

Would the following be a compromise to resolve this?:

If it has a __dataframe__ attribute then:

we check for the existence of methods named "arrow", "to_arrow", or "to_arrow_table" and call one of these if they exist.

And else use the formal dataframe interchange protocol.

jonmmease commented 8 months ago

Yeah, I think requiring the __dataframe__ attribute (which we use for encoding type inference), and then allowing the object to provide their own arrow conversion with "arrow", "to_arrow", or "to_arrow_table" makes good sense.

jonmmease commented 8 months ago

I updated https://github.com/altair-viz/altair/pull/3377 to include this logic, which now fixes this Date32 issue when using Polars and PyArrow.