Open wirhabenzeit opened 2 months ago
I am not exactly sure what is going wrong, but suddenly all American cars are in the Europe facet, some European cars are in the Japan facet, the Japanese cars are in the correct facet, the 4-Cylinder cars are in the 5 and 6-Cylinder facets, etc. (There is probably some obvious pattern here which I am missing)
I checked the Vega-Lite output and I think the issue is the
sort
parameter of the resulting spec file.What would you like to happen instead?
The same code with
pl.String
columns works as expected
https://docs.pola.rs/api/python/stable/reference/api/polars.datatypes.Categorical.html
@wirhabenzeit you'll need to use pl.Categorical("lexical")
for this behavior:
import altair as alt
import polars as pl
from vega_datasets import data
df = pl.DataFrame(data.cars()).with_columns(
pl.col("Origin", "Cylinders").cast(pl.String).cast(pl.Categorical("lexical"))
)
alt.Chart(df).mark_point().properties(width=150, height=150).encode(
x="Horsepower",
y="Miles_per_Gallon",
shape="Cylinders",
color=alt.Color("Origin").scale(scheme="category10"),
).facet(row="Origin", column="Cylinders")
@dangotbanned Hmmm I think you misunderstood the issue. The issue is not that the order of facets is not lexicographical. The issue is that for categorical columns the resulting plot simply puts data points in wrong facets. If you look at the example above, then the blue points all should be in the USA facet, irrespective of the ordering of the rows.
In fact when I encountered this issue I used categorical encoding precisely to be able to specify an order, but then the plot just becomes erratic.
@dangotbanned Hmmm I think you misunderstood the issue. The issue is not that the order of facets is not lexicographical. The issue is that for categorical columns the resulting plot simply puts data points in wrong facets. If you look at the example above, then the blue points all should be in the USA facet, irrespective of the ordering of the rows.
In fact when I encountered this issue I used categorical encoding precisely to be able to specify an order, but then the plot just becomes erratic.
@wirhabenzeit Could you explain the difference between these two?
I'm more than happy to reopen the issue if I've misunderstood, but they look the same to me?
@dangotbanned There is no difference. Maybe I explained it poorly. My bug report is that faceting with categorical columns which are not lexical
results in data points appearing in wrong facets. Above I used the lexical ordering with string-columns only to show the bug. The output I would like is the output which respects the categorical order and does not put points in wrong facets.
Thanks for raising this issue @wirhabenzeit! This is a very interesting issue you are raising. I can reproduce the issue you are describing, but I'm not sure exactly what is going on. Will investigate a bit more what changed with the categorical definition. The usage that you describe sounds solid to me. Maybe this is a regression with 5.4? Anyway, it is reproducible! Thanks again for your time to raise this issue!
@mattijn I have looked around more and I think this goes back to https://github.com/vega/vega-lite/issues/5937
Basically there is a long-standing bug in Vega-Lite with facet-sorting whenever there are missing data points in some of the facets. I did not find it initially because I was focused on pl.Categorical
and did not suspect it was a problem with the sorting.
@wirhabenzeit
@dangotbanned There is no difference. Maybe I explained it poorly. My bug report is that faceting with categorical columns which are not
lexical
results in data points appearing in wrong facets. Above I used the lexical ordering with string-columns only to show the bug. The output I would like is the output which respects the categorical order and does not put points in wrong facets.
@mattijn
Thanks for raising this issue @wirhabenzeit! This is a very interesting issue you are raising. I can reproduce the issue you are describing, but I'm not sure exactly what is going on. Will investigate a bit more what changed with the categorical definition. The usage that you describe sounds solid to me. Maybe this is a regression with 5.4? Anyway, it is reproducible! Thanks again for your time to raise this issue!
I'm still unsure how this isn't explained by the nondeterministic ordering in polars
, but reopened since @mattijn seems to get it
Might also be related to https://github.com/vega/vega-lite/issues/8675 which was reported in Altair here https://github.com/vega/altair/issues/3481.
Yeah, The referenced VL issues are relevant here.
But just to be complete, what is happening. Altair tries to sort the fields in your column ascending when defined as type str
on an encoding channel.
So when having this data:
import polars as pl
import altair as alt
df = pl.DataFrame({"value": [2, 5, 3], "month": ["jan", "feb", "mar"]})
And visualising it with the month
on the x
-axis channel and the values
on the color
channel using a rect
-mark
chart = alt.Chart(df).mark_rect().encode(
x='month',
color='value'
)
chart
It can be seen that the x
-axis is ordered by feb
, jan
, mar
, since the f
comes for j
in the alphabet.
So by casting the month
column in the dataframe as being a categorical
in order of appearance (default of polars). We get the following:
df_catg = df.with_columns(pl.col("month").cast(pl.Categorical))
chart_catg = alt.Chart(df_catg).mark_rect().encode(
x='month',
color='value'
)
chart_catg
The x
-axis is now ordered by jan
, feb
, mar
, like the order as is defined in the dataframe.
By comparing the Vega-Lite specification of both charts we notice that the categorical
column is serialised differently.
Top chart, column is of type str
and it becomes:
chart.to_dict()['encoding']['x']
{'field': 'month', 'type': 'nominal'}
With categorical column defined, it becomes:
chart_catg.to_dict()['encoding']['x']
{'field': 'month', 'sort': ['jan', 'feb', 'mar'], 'type': 'ordinal'}
The sort
order is serlialized from the categorical definition of the month
column in the DataFrame.
All good so far!
Sidenote Observe that the
type
is also different,ordinal
for the dataframe withmonth
column defined as categorical andnominal
for themonth
column just defined asstr
. The effect of this is that when you use the categoricalmonth
column for the color encoding channel it is treated as an ordered categorical and therefor adding a sequential color scheme, versus the default which provides distinct color values forstr
values:alt.vconcat( chart.encode(color="month"), chart_catg.encode(color="month") ).resolve_scale( color="independent" )
But upon adding encoding channels such as row
and column
this logic for sorting categorical columns in the DataFrame is breaking the rendering when there are combinations that contains no data.
The following goes well, but the order of the column
may be seen as not right.
import polars as pl
import altair as alt
df = pl.DataFrame(
{
"time": [0, 1, 0, 1, 0, 1, 0, 1],
"value": [0, 5, 0, 5, 0, 5, 0, 5],
"choice": ["A", "A", "B", "B", "A", "A", "B", "B"],
"month": ["jan", "jan", "feb", "feb", "feb", "feb", "mar", "mar"],
}
)
chart = alt.Chart(df, height=100, width=100).mark_line().encode(
x='time',
y='value',
color='choice',
row='choice',
column='month'
)
chart
So upon defining the column month
as categorical, the order of the months in the column
encoding is correctly sorted, but the data within the plots are incorrect.
df_catg = df.with_columns(pl.col("month").cast(pl.Categorical))
chart_catg = alt.Chart(df_catg, height=100, width=100).mark_line().encode(
x='time',
y='value',
color='choice',
row='choice',
column='month'
)
chart_catg
Leading indeed to data being drawn within the wrong subplot! So, indeed, be very careful here!
One can use the following workaround when having a polars DataFrame as in OP:
df_complete = (
df.select(pl.col(["choice", "month"]).unique().implode())
.explode("choice")
.explode("month")
.join(df, how="left", on=["choice", "month"])
)
df_complete_sorted = df_complete.sort(pl.col("month").cast(pl.Enum(["jan", "feb", "mar"])))
df_complete_catg = df_complete_sorted.with_columns(pl.col("month").cast(pl.Categorical))
df_complete_catg
chart_complete_catg = alt.Chart(df_complete_catg, height=100, width=100).mark_line().encode(
x='time',
y='value',
color='choice',
row='choice',
column='month'
)
chart_complete_catg
This workaround basically makes sure that all combinations that are possible to make with the row/column channel encoding, are actually existing in the DataFrame, albeit filled with a null
value.
@mattijn Thanks for investigating! As far as I can see the issue arises on a group level, e.g. when grouping by facet, but also specifying encodings such as color or shape, then data gets misplaced as soon as any group (like row a, column b, color c, shape d) contains no data points. Could there be an automatic way of detecting this on the Altair side, and issuing a warning? Probably that’s difficult in case categories are derived using transformations etc?
Thanks for your response. You mean you can introduce this behavior without a row
/column
encoding channel included? Do you have an example of this? That seems more troublesome and indeed require more feedback to the user. A warning at best or at least a note in the documentation.
No, I think without rows/columns the issue is not there. What I meant is that problems arises as soon as in some facet some color/shape group has no data points. So for the workaround one would need to fill in nulls not only for empty facets but also empty groups within a facet. In my original example above the two plots are not just different in the sense that some entire facets are in the wrong place, but the individual facets are also different. I can try to produce a more minimal example showing this.
There seems something going on with polars too. First, if I do
import polars as pl
import vega_datasets
df = pl.DataFrame(vega_datasets.data.cars()).with_columns(
pl.col("Origin", "Cylinders").cast(pl.String).cast(pl.Categorical("lexical"))
)
The order is correct in the chart, but when doing:
df['Cylinders'].cat.get_categories().to_list()
I get
['8', '4', '6', '3', '5']
So it is not really clear to me, how the chart specification can know the right order.
But if I try to force the categorical order using an Enum:
import vega_datasets
import polars as pl
df = pl.from_pandas(vega_datasets.data.cars()).with_columns(
pl.col("Origin"),
pl.col("Cylinders").cast(pl.String).cast(pl.Categorical),
)
uniq_cylinders = df['Cylinders'].unique().to_list()
print('cast Enum', sorted(uniq_cylinders))
df_sort = df.sort(pl.col('Cylinders').cast(pl.Enum(sorted(uniq_cylinders)))) # ['3', '4', '5', '6', '8']
df_catg = df_sort.with_columns(pl.col('Cylinders').cast(pl.Categorical))
df_catg['Cylinders'].cat.get_categories().to_list()
It returns
cast Enum ['3', '4', '5', '6', '8']
['8', '4', '6', '3', '5']
And a wrongly sorted chart. @dangotbanned, do you know more about this behaviour of polars?
And a wrongly sorted chart. @dangotbanned, do you know more about this behaviour of polars?
@mattijn I can help but could you add some comments - explaining the intention behind each action you've taken please?
But if I try to force the categorical order using an Enum:
Code block
import vega_datasets
import polars as pl
df = pl.from_pandas(vega_datasets.data.cars()).with_columns(
pl.col("Origin"),
pl.col("Cylinders").cast(pl.String).cast(pl.Categorical),
)
uniq_cylinders = df['Cylinders'].unique().to_list()
print('cast Enum', sorted(uniq_cylinders))
df_sort = df.sort(pl.col('Cylinders').cast(pl.Enum(sorted(uniq_cylinders)))) # ['3', '4', '5', '6', '8']
df_catg = df_sort.with_columns(pl.col('Cylinders').cast(pl.Categorical))
df_catg['Cylinders'].cat.get_categories().to_list()
I'm having trouble understanding as this reads more like pandas
than polars
code
My immediate thoughts are:
maintain_order=True
when using unique
https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.unique.html#polars-expr-uniqueI should have elaborated in https://github.com/vega/altair/issues/3588#issuecomment-2344444054 but to me the issue seems to be wanting some explicit behavior - without using any of the explicit features of polars
.
So one way to look at this, is if you tell polars
to do something it will try to optimize for the fastest query to get there.
However if you have some constraint that hasn't been defined - then you may be surprised when that gets optimized out.
Maybe this section of their user guide would be helpful?
Also https://docs.pola.rs/user-guide/concepts/data-types/categoricals/
I notice one thing what is different.
If I define the dataframe as you suggested using a lexical
option within the pl.Categorical()
it is not persisted or included when compiling to Vega-Lite:
df = pl.DataFrame(vega_datasets.data.cars()).with_columns(
pl.col("Origin", "Cylinders").cast(pl.String).cast(pl.Categorical("lexical"))
)
chart = alt.Chart(df).mark_point().properties(width=100, height=100).encode(
x="Horsepower",
y="Miles_per_Gallon",
shape="Cylinders",
color="Origin",
).facet(row="Origin", column="Cylinders")
print(df.get_column("Cylinders").cat.get_categories())
print(chart.to_dict()['facet'])
shape: (5,)
Series: 'Cylinders' [str]
[
"8"
"4"
"6"
"3"
"5"
]
{'column': {'field': 'Cylinders', 'type': 'nominal'}, 'row': {'field': 'Origin', 'type': 'nominal'}}
As you can see, there is no sort
defined for the column
encoding channel. Therefor the order is correct in this case, but as a false-positive.
Where in my ugly (no fun indeed!) defined DataFrame it actually includes the sort
for the column
encoding channel.
{'column': {'field': 'Cylinders', 'sort': ['8', '4', '6', '3', '5'], 'type': 'ordinal'}, 'row': {'field': 'Origin', 'type': 'nominal'}}
Long story short, how does the inference works of a polars column casted as a lexical categorical? Is it correct that there is no sort
definition in the corresponding Vega-Lite specification?
Long story short, how does the inference works of a polars column casted as a lexical categorical? Is it correct that there is no
sort
definition in the corresponding Vega-Lite specification?
Thanks @mattijn for the detail!
So this part can be answered (I think) with narwhals.is_ordered_categorical
and alt.utils.core.infer_vegalite_type_for_narwhals
:
From what I'm understanding of https://github.com/narwhals-dev/narwhals/blob/aed2d515a2e26465a6edecf8d7aa560353cbdfa2/narwhals/utils.py#L401-L407
The type will be
"ordinal"
for pl.Categorical("physical")
, pl.Enum
"nominal"
for pl.Categorical("lexical")
, pl.Enum
nw.Enum
, pl.String
cc @MarcoGorelli to double check
Misunderstood that nw.Enum
!= pl.Enum
nw.Enum
can represent more than only pl.Enum
narwhals.is_ordered_categorical
For Polars: Enums are always ordered. Categoricals are ordered if dtype.ordering == "physical".
Thanks for adding more info on the table! But I'm not sure if I can read an answer in this already.
Or can I understand from here that a categorical with dtype.ordering == "lexical"
is intentionaly not ordered? And therefor casting to pl.Categorical('lexcial')
is correctly not adding a sort
argument to the Vega-Lite specification?
Thanks for adding more info on the table! But I'm not sure if I can read an answer in this already. Or can I understand from here that a categorical with
dtype.ordering == "lexical"
is intentionaly not ordered? And therefor casting topl.Categorical('lexcial')
is correctly not adding asort
argument to the Vega-Lite specification?
@mattijn no worries, yeah you've understood that correctly
Thanks for the ping!
Misunderstood that nw.Enum != pl.Enum
I think they should be the same? As in, pl.Enum
should be recognised as nw.Enum
:
In [21]: nw.from_native(pl.Series(['a', 'b', 'c'], dtype=pl.Enum(['b', 'a', 'c', 'd'])), allow_series=True).dtype == nw.Enum
Out[21]: True
Regarding physical
vs lexical
, I don't think that get_categories
reflects the order - but maybe it should? The difference can be seen if you compare the categories, e.g. in a sort:
In [22]: pl.Series(['b', 'a', 'c'], dtype=pl.Categorical('lexical')).sort()
Out[22]:
shape: (3,)
Series: '' [cat]
[
"a"
"b"
"c"
]
In [23]: pl.Series(['b', 'a', 'c'], dtype=pl.Categorical('physical')).sort()
Out[23]:
shape: (3,)
Series: '' [cat]
[
"b"
"a"
"c"
]
but they both return the same output for .cat.get_categories()
. Do you need the output of .cat.get_categories
to reflect the category ordering?
nw.is_ordered_categorical
just does what Polars does in its dataframe interchange protocol definition:
if dtype == Categorical:
categories = self._col.cat.get_categories()
is_ordered = dtype.ordering == "physical" # type: ignore[attr-defined]
elif dtype == Enum:
categories = dtype.categories # type: ignore[attr-defined]
is_ordered = True
else:
msg = "`describe_categorical` only works on categorical columns"
raise TypeError(msg)
the interchange protocol definition is a bit vague here, it just says "whether the ordering of dictionary indices is semantically meaningful"
Thanks @MarcoGorelli
Thanks for the ping!
Misunderstood that nw.Enum != pl.Enum
I think they should be the same? As in,
pl.Enum
should be recognised asnw.Enum
:In [21]: nw.from_native(pl.Series(['a', 'b', 'c'], dtype=pl.Enum(['b', 'a', 'c', 'd'])), allow_series=True).dtype == nw.Enum Out[21]: True
So I goofed on this one 🤦♂️
In https://github.com/vega/altair/issues/3588#issuecomment-2346275689 I was trying to explain this bit where nw.Enum
is representing non-polars
Enums.
AFAIK pl.Enum
wouldn't reach that branch:
https://github.com/vega/altair/blob/a171ce8cb2f0b0cb0c944ddbd0c0623282570c0c/altair/utils/core.py#L712-L722
Maybe I should've wrote nw.Enum
>= pl.Enum
- or skipped the operators entirely
Regarding the original post, I can reproduce it with Altair 5.3.0, but not with Altair 5.4.1
I can still reproduce it in Altair 5.4.1 however, if I use pl.Categorical('lexical')
instead of pl.Categorical
EDIT: this comment is outdated, please ignore - once I did uv cache clean
and reinstalled, I could indeed reproduce the issue as-reported
I think the issue reproduces with pandas ordered categoricals too, both on Altair 5.3.0 and Altair 5.4.1
code:
import altair as alt
import pandas as pd
df_catg2 = pd.DataFrame(
{
"time": [0, 1, 0, 1, 0, 1, 0, 1],
"value": [0, 5, 0, 5, 0, 5, 0, 5],
"choice": ["A", "A", "B", "B", "A", "A", "B", "B"],
"month": ["jan", "jan", "feb", "feb", "feb", "feb", "mar", "mar"],
}
)
df_catg2["month"] = df_catg2["month"].astype(pd.CategoricalDtype(ordered=True))
chart_catg2 = (
alt.Chart(df_catg2, height=100, width=100)
.mark_line()
.encode(x="time", y="value", color="choice", row="choice", column="month")
)
chart_catg2
Pff, complicated. Altair assumes that the returned categories are in sorted order when it is defined as ordered, but this is an assumption that does not always hold.
get_categories()
returns the custom order as is defined. Result is that the custom order list from get_categories()
is used to sort
the y
-encoding channel in the following chart.
my_order = ["k", "z", "b", "a"]
df = pl.from_dict({"cats": ['z', 'z', 'k', 'a', 'b'], "vals": [3, 1, 2, 2, 3]}) df = df.with_columns(pl.col("cats").cast(pl.Enum(my_order)))
nw_s = nw.from_native(df.get_column("cats"), allow_series=True) print('ordered according to narwhals:', nw.is_ordered_categorical(nw_s)) print('sort order of get categories:', df.get_column("cats").cat.get_categories().to_list())
chart = alt.Chart(df, title='pl.Enum(my_order)').mark_bar().encode( x='vals', y='cats' ) print('y-encoding sort definition:', chart.to_dict()['encoding']['y']) chart
> <img width="684" alt="image" src="https://github.com/user-attachments/assets/aeb40a82-eb72-44a3-acca-812905bfab98">
- Physical categorical is going OK in Altair.
The column is seen as an ordered categorical. `get_categories()` returns physical categorical sorted in physical order. Result is that the physical ordered list from `get_categories()` is used to `sort` the `y`-encoding channel in the following chart.
```python
df = pl.from_dict({"cats": ['12', '4', '2'], "vals": [3, 1, 2]})
df = df.with_columns(pl.col("cats").cast(pl.Categorical())) # 'physical'
nw_s = nw.from_native(df.get_column("cats"), allow_series=True)
print('ordered according to narwhals:', nw.is_ordered_categorical(nw_s))
print('sort order of get categories:', df.get_column("cats").cat.get_categories().to_list())
chart = alt.Chart(df, title='pl.Categorical()').mark_bar().encode(
x='vals',
y='cats'
)
print('y-encoding sort definition:', chart.to_dict()['encoding']['y'])
chart
get_categories()
does not return lexical categorical sorted in lexical order. Result is that the list from get_categories()
is not used for the y
-encoding channel. Since there is no sort
defined it applies ascending sorting within Vega, making it look like that the lexical categorical has impact.
df = pl.from_dict({"cats": ['12', '4', '2'], "vals": [3, 1, 2]})
df = df.with_columns(pl.col("cats").cast(pl.Categorical('lexical')))
nw_s = nw.from_native(df.get_column("cats"), allow_series=True) print('ordered according to narwhals:', nw.is_ordered_categorical(nw_s)) print('sort order of get categories:', df.get_column("cats").cat.get_categories().to_list())
chart = alt.Chart(df, title="pl.Categorical('lexical')").mark_bar().encode( x='vals', y='cats' ) print('y-encoding sort definition:', chart.to_dict()['encoding']['y']) chart
> <img width="684" alt="image" src="https://github.com/user-attachments/assets/f6579f4f-a8d7-4a11-96ac-1accb5115421">
To support lexical categorical, it should
1. Be considered as ordered by narwhals.
2. The sort order of the `get_categories()` should be reflecting the lexical order.
Current implemention of `nw.is_ordered_categorical` only allows order to be defined based on numeric values and not on alphabet (lexical).
Apparently the situation is different for pandas ordered categorical. Since it does not always return the _sorted_ physical ordered categorical.
***
Btw. When trying OP as you did in https://github.com/vega/altair/issues/3588#issuecomment-2347221524. I get this:
```python
import polars as pl
import vega_datasets
import altair as alt
alt.Chart(
pl.from_pandas(vega_datasets.data.cars()).with_columns(
pl.col("Origin").cast(pl.Categorical),
pl.col("Cylinders").cast(pl.String).cast(pl.Categorical),
)
).mark_point().properties(width=150, height=150).encode(
x="Horsepower",
y="Miles_per_Gallon",
shape="Cylinders",
color=alt.Color("Origin").scale(scheme="category10"),
).facet(row="Origin", column="Cylinders").to_dict()['facet']
{'column': {'field': 'Cylinders', 'sort': ['8', '4', '6', '3', '5'], 'type': 'ordinal'}, 'row': {'field': 'Origin', 'sort': ['USA', 'Europe', 'Japan'], 'type': 'ordinal'}}
With ('5.5.0dev', '1.6.2')
for alt.__version__, nw.__version__
. Reflecting the behaviour you have when using Altair version 5.3.0
...
Reflecting the behaviour you have when using Altair version 5.3.0...
Right, sorry about that, I just did uv cache clean
, reinstalled everything, and indeed I can reproduce the original post - I've marked my previous comment as outdated
Physical categorical is going OK in Altair.
Are you sure about this? It seems to me that anything which is auto-inferred to be "ordinal" (as opposed to "nominal") is subject to issues
For example, if we start with
import altair as alt
import pandas as pd
import polars as pl
df_cat = pd.DataFrame(
{
"time": [0, 1, 0, 1, 0, 1, 0, 1],
"value": [0, 5, 0, 5, 0, 5, 0, 5],
"choice": ["A", "A", "B", "B", "A", "A", "B", "B"],
"month": ["jan", "jan", "feb", "feb", "feb", "feb", "mar", "mar"],
}
)
then:
df_cat["month"] = df_cat["month"].astype(pd.CategoricalDtype(ordered=True))
chart_cat = (
alt.Chart(df_cat, height=100, width=100)
.mark_line()
.encode(x="time", y="value", color="choice", row="choice", column="month")
)
chart_cat
df_cat["month"] = df_cat["month"].astype(pd.CategoricalDtype(ordered=False))
chart_cat = (
alt.Chart(df_cat, height=100, width=100)
.mark_line()
.encode(x="time", y="value", color="choice", row="choice", column="month")
)
chart_cat
df_cat = pl.from_pandas(df_cat).with_columns(
pl.col('month').cast(pl.Categorical('physical'))
)
chart_cat = (
alt.Chart(df_cat, height=100, width=100)
.mark_line()
.encode(x="time", y="value", color="choice", row="choice", column="month")
)
chart_cat
df_cat = pl.from_pandas(df_cat).with_columns(
pl.col('month').cast(pl.Categorical('lexical'))
)
chart_cat = (
alt.Chart(df_cat, height=100, width=100)
.mark_line()
.encode(x="time", y="value", color="choice", row="choice", column="month")
)
chart_cat
To support lexical categorical, it should
- Be considered as ordered by narwhals.
- The sort order of the get_categories() should be reflecting the lexical order.
I've tried doing this, but then the output from the example above becomes incorrect for both physical and lexical
Not sure about anything anymore, but I think we have identified at least four issues/anomalies by now:
row
/column
encoding channel in combination with a sort
parameter it will place your data in incorrect subplots if some of the panels has no any defined data. Related issue defined in VL https://github.com/vega/vega-lite/issues/5937 pd.Series(['4', '2', '12'], dtype=pd.CategoricalDtype(ordered=True)).cat.categories.to_list()
['12', '2', '4'] # physical sorted is ['4', '2', '12']
pl.Series(['4', '2', '12']).cast(pl.Categorical('lexical')).cat.get_categories().to_list()
['4', '2', '12'] # lexical sorted is ['12', '2', '4']
pl_s1
)
s1 = pd.Series(['4', '2', '12'], dtype='category')
s2 = pd.Series(['4', '2', '12'])
pl_s1 = pl.from_pandas(s1).cast(pl.Categorical('physical')).cat.get_categories().to_list() pl_s2 = pl.from_pandas(s2).cast(pl.Categorical('physical')).cat.get_categories().to_list() pl_s1, pl_s2
> ```python
> (['12', '2', '4'], ['4', '2', '12'])
> ```
For clarity, data without defined categorical is returning its categories sorted in physical order when casted to physical categorical in polars (`pl_s2`)
***
Regarding my comment, a few clarification notes in _[italic]_:
> To support _[inference of columns with its type casted as]_ lexical categorical, _[the column]_ should
>
> 1. Be considered as ordered by narwhals.
> 2. The sort order of the `get_categories()` _[of this column]_ should be reflecting the lexical order.
So basically, for proper dataframe inference of ordered categoricals then:
- A lexical categorical should return its categories sorted in lexical order.
- A physical categorical should return its categories sorted in physical order.
Also meaning, that this will currently lead to data being placed in wrong subplots if there are panels without data for both lexical and pysical ordered categoricals, since there will be a `sort` defined for the `row` / `column` encodings, as is described in point 1 in this comment.
Pinging @c-peters for additional context, as they may have the best understanding of pl.Categorical
:
Forgive me if there is something I am misunderstanding, but it seems like all the issues reported here could stem from VegaLite not handling the sort
keyword correctly when faceting into rows and columns as per https://github.com/vega/vega-lite/issues/5937. I think it is difficult to properly troubleshoot what is happening with the categorical field sorting in row and colun facets until this is fixed in VegaLite.
Outside row and column faceting, all the scenarios with pd and pl categories work as expected as far as I can see (i.e. the order of the color scale match the order of the categories in each of these examples):
Identified as ordinal as expected:
Changing the categorical order changes the color scale order:
Identified as nominal as expected:
Identified as ordinal as expected:
As already pointed out above, it seems that identification of lexical categories as ordinal data is not yet supported since it is not indicated as categorical data by narwhals and thus we get back unsorted nominal data:
Which would be the same as if we used the pl physical data frame and explicitly encoded the data type as nominal:
I'm not too familiar with what happens in Altair / Narwhals, but indeed the call to get_categories
does not return the categories in sorted order.
Would it be possible to call .sort()
for the lexical ordered case: pl.col("category_column").cat.get_categories().sort()
What happened?
Faceting by
pl.Categorical
columns results in wrong facetsI am not exactly sure what is going wrong, but suddenly all American cars are in the Europe facet, some European cars are in the Japan facet, the Japanese cars are in the correct facet, the 4-Cylinder cars are in the 5 and 6-Cylinder facets, etc. (There is probably some obvious pattern here which I am missing)
I checked the Vega-Lite output and I think the issue is the
sort
parameter of the resulting spec file.What would you like to happen instead?
The same code with
pl.String
columns works as expected:Which version of Altair are you using?
5.4.1