Closed ariewjy closed 8 months ago
Thanks for the report @ariewjy, and apologies for the slow response. I'll have to look more closely, but this looks like a limitation in pyarrow's implementation of the DataFrame interchange protocol. We should be able to work around it, but will take some experimentation to see where it makes sense to do this.
So I looked into this a bit more, and found that pyarrow doesn't support loading Date32 columns through the dataframe interchange protocol. https://github.com/apache/arrow/issues/39539. I posted a repro of this error in that thread and asked what the best path forward is.
We could work around this by introducing an optional dependency on polars, but we were hoping we could avoid this and focus only on supporting the dataframe interchange protocol. But I would very much like Altair to work smoothly with Polars, so depending on how likely this is to be resolved in pyarrow, we may need to do this in the short term.
cc @mattijn in case you have thoughts
If the dataframe interchange protocol is undecisive how to support this then we should solve it pragmatic. There is not always a royal road. I'm not in favor to introduce an optional polars dependency just to do an isinstance
check on the polars dataframe.
We can do a simple check on the return type of the instance of the .__dataframe__
and if it is a PolarsDataFrame
we know we can call the .to_arrow()
function directly without entering the more complicated and slower dataframe interchange protocol.
# add new function
def dataframe_instance(obj):
return obj.__dataframe__.__annotations__["return"]
# adaptations required in https://github.com/altair-viz/altair/blob/main/altair/utils/data.py
if hasattr(data, "__dataframe__"):
if dataframe_instance(data) == 'PolarsDataFrame':
pa_table = data.to_arrow()
else:
pi = import_pyarrow_interchange()
pa_table = pi.from_dataframe(data)
By the way for me this issue won't raise an error and also not the example you provided in your comment at the pyarrow repository:
import pyarrow as pa
import polars as pl
import pyarrow.interchange as pi
print(pa.__version__, pl.__version__)
data = pl.DataFrame({"date": [datetime.date(2024, 3, 22)]})
pi.from_dataframe(data)
12.0.1 0.20.16 pyarrow.Table date: int32 ---- date: [[19804]]
Where:
data.to_arrow()
pyarrow.Table date: date32[day] ---- date: [[2024-03-22]]
Even more reason to not use the dataframe interchange protocol here..
Hm, that function I propose will not always work (eg from this issue: https://github.com/vega/vegafusion/issues/386):
import pandas as pd
import vega_datasets
from pandas.core.interchange.dataframe import PandasDataFrameXchg
class NoisyDfInterface(pd.core.interchange.dataframe.PandasDataFrameXchg):
def __dataframe__(self, allow_copy: bool = True):
return NoisyDfInterface(self._df, allow_copy=allow_copy)
def get_column_by_name(self, name):
print(f"get_column_by_name('{name}')")
return super().get_column_by_name(name)
cars = vega_datasets.data.cars()
dfy = NoisyDfInterface(cars)
type(dfy).__name__
'NoisyDfInterface'
But:
dataframe_instance(dfy)
----> [7] return obj.__dataframe__.__annotations__["return"] KeyError: 'return'
Initially I thought we can use type(dfy).__name__
, but for both a pandas DataFrame and a polars DataFrame will this return 'DataFrame'
It is getting messy:
if hasattr(data, "__dataframe__"):
if 'polars' in type(data).__module__:
pa_table = data.to_arrow()
Open for other suggestions @jonmmease 😄
Haha, yeah, that would work. Another option is to lean into duck typing and just check for the existence of a data.to_arrow()
method on the object and call that instead of using the __dataframe__
interface.
Looks like cudf has the same method: https://docs.rapids.ai/api/cudf/legacy/user_guide/api_docs/api/cudf.dataframe.to_arrow/
I think vaex uses to_arrow_table()
, and duckdb supports .arrow()
and .to_arrow_table()
. So maybe we check for the existence of methods named "arrow", "to_arrow", or "to_arrow_table" and call one of these if they exist.
Would the following be a compromise to resolve this?:
If it has a __dataframe__
attribute then:
we check for the existence of methods named "arrow", "to_arrow", or "to_arrow_table" and call one of these if they exist.
And else use the formal dataframe interchange protocol.
Yeah, I think requiring the __dataframe__
attribute (which we use for encoding type inference), and then allowing the object to provide their own arrow conversion with "arrow", "to_arrow", or "to_arrow_table" makes good sense.
I updated https://github.com/altair-viz/altair/pull/3377 to include this logic, which now fixes this Date32 issue when using Polars and PyArrow.
Issue when altair (version 5.2.0) chart would not render if any of the column is in "date" temporal data type.
I found that whenever there is a
date
data type column in the polars dataframe, the altair would not render the chart properly. Only when there is nodate
column, either by removing it completely, or by casting it to pandas or different data type as below.Code to reproduce issue
As shown above,
try_parse_dates=True
the date column will be "date
" format.Using altair 5.2.0 version, if plotted using
rect
encoding the output will not be rendered.There are two workarounds to avoid this issue:
1. Convert to pandas dataframe using
to_pandas()
before using altair.This means the altair will use the familiar format (pandas dataframe) instead of polars, however that means we wouldn't get the benefit of using polars dataframe later on (e.g. speed, etc.).
Perhaps works best if we are not planning to do any data wrangling with polars later on.
Using the same code to make a chart from altair would work:
However, if we plan to use polars as the main dataframe library, then the second method is better.
2. Parse the date column to
datetime
type instead ofdate
data type.Casting the str type to
datetime[μs]
before we make the plot in altair.Then just do the same chart plotting using altair would then work.