vega / altair

Declarative statistical visualization library for Python
https://altair-viz.github.io/
BSD 3-Clause "New" or "Revised" License
9.4k stars 795 forks source link

Faceting bug for categorical columns #3588

Open wirhabenzeit opened 2 months ago

wirhabenzeit commented 2 months ago

What happened?

Faceting by pl.Categorical columns results in wrong facets

alt.Chart(
    pl.from_pandas(vega_datasets.data.cars()).with_columns(
        pl.col("Origin").cast(pl.Categorical),
        pl.col("Cylinders").cast(pl.String).cast(pl.Categorical),
    )
).mark_point().properties(width=150, height=150).encode(
    x="Horsepower",
    y="Miles_per_Gallon",
    shape="Cylinders",
    color=alt.Color("Origin").scale(scheme="category10"),
).facet(row="Origin", column="Cylinders")

facets-not-ok

I am not exactly sure what is going wrong, but suddenly all American cars are in the Europe facet, some European cars are in the Japan facet, the Japanese cars are in the correct facet, the 4-Cylinder cars are in the 5 and 6-Cylinder facets, etc. (There is probably some obvious pattern here which I am missing)

I checked the Vega-Lite output and I think the issue is the sort parameter of the resulting spec file.

What would you like to happen instead?

The same code with pl.String columns works as expected:

alt.Chart(
    pl.from_pandas(vega_datasets.data.cars()).with_columns(
        pl.col("Origin"), 
        pl.col("Cylinders").cast(pl.String)
    )
).mark_point().properties(width=150, height=150).encode(
    x="Horsepower",
    y="Miles_per_Gallon",
    shape="Cylinders",
    color=alt.Color("Origin").scale(scheme="category10"),
).facet(row="Origin", column="Cylinders")

facets-ok

Which version of Altair are you using?

5.4.1

dangotbanned commented 2 months ago

I am not exactly sure what is going wrong, but suddenly all American cars are in the Europe facet, some European cars are in the Japan facet, the Japanese cars are in the correct facet, the 4-Cylinder cars are in the 5 and 6-Cylinder facets, etc. (There is probably some obvious pattern here which I am missing)

I checked the Vega-Lite output and I think the issue is the sort parameter of the resulting spec file.

What would you like to happen instead?

The same code with pl.String columns works as expected

https://docs.pola.rs/api/python/stable/reference/api/polars.datatypes.Categorical.html

@wirhabenzeit you'll need to use pl.Categorical("lexical") for this behavior:

import altair as alt
import polars as pl
from vega_datasets import data

df = pl.DataFrame(data.cars()).with_columns(
    pl.col("Origin", "Cylinders").cast(pl.String).cast(pl.Categorical("lexical"))
)

alt.Chart(df).mark_point().properties(width=150, height=150).encode(
    x="Horsepower",
    y="Miles_per_Gallon",
    shape="Cylinders",
    color=alt.Color("Origin").scale(scheme="category10"),
).facet(row="Origin", column="Cylinders")
Output ![image](https://github.com/user-attachments/assets/8d352fa5-c636-4614-82bc-eae6cf0d55ef)
wirhabenzeit commented 2 months ago

@dangotbanned Hmmm I think you misunderstood the issue. The issue is not that the order of facets is not lexicographical. The issue is that for categorical columns the resulting plot simply puts data points in wrong facets. If you look at the example above, then the blue points all should be in the USA facet, irrespective of the ordering of the rows.

In fact when I encountered this issue I used categorical encoding precisely to be able to specify an order, but then the plot just becomes erratic.

dangotbanned commented 2 months ago

@dangotbanned Hmmm I think you misunderstood the issue. The issue is not that the order of facets is not lexicographical. The issue is that for categorical columns the resulting plot simply puts data points in wrong facets. If you look at the example above, then the blue points all should be in the USA facet, irrespective of the ordering of the rows.

In fact when I encountered this issue I used categorical encoding precisely to be able to specify an order, but then the plot just becomes erratic.

@wirhabenzeit Could you explain the difference between these two?

I'm more than happy to reopen the issue if I've misunderstood, but they look the same to me?

What you would like to happen

facets-ok

Output in https://github.com/vega/altair/issues/3588#issuecomment-2344322657

image

wirhabenzeit commented 2 months ago

@dangotbanned There is no difference. Maybe I explained it poorly. My bug report is that faceting with categorical columns which are not lexical results in data points appearing in wrong facets. Above I used the lexical ordering with string-columns only to show the bug. The output I would like is the output which respects the categorical order and does not put points in wrong facets.

mattijn commented 2 months ago

Thanks for raising this issue @wirhabenzeit! This is a very interesting issue you are raising. I can reproduce the issue you are describing, but I'm not sure exactly what is going on. Will investigate a bit more what changed with the categorical definition. The usage that you describe sounds solid to me. Maybe this is a regression with 5.4? Anyway, it is reproducible! Thanks again for your time to raise this issue!

wirhabenzeit commented 2 months ago

@mattijn I have looked around more and I think this goes back to https://github.com/vega/vega-lite/issues/5937 Basically there is a long-standing bug in Vega-Lite with facet-sorting whenever there are missing data points in some of the facets. I did not find it initially because I was focused on pl.Categorical and did not suspect it was a problem with the sorting.

dangotbanned commented 2 months ago

@wirhabenzeit

@dangotbanned There is no difference. Maybe I explained it poorly. My bug report is that faceting with categorical columns which are not lexical results in data points appearing in wrong facets. Above I used the lexical ordering with string-columns only to show the bug. The output I would like is the output which respects the categorical order and does not put points in wrong facets.

@mattijn

Thanks for raising this issue @wirhabenzeit! This is a very interesting issue you are raising. I can reproduce the issue you are describing, but I'm not sure exactly what is going on. Will investigate a bit more what changed with the categorical definition. The usage that you describe sounds solid to me. Maybe this is a regression with 5.4? Anyway, it is reproducible! Thanks again for your time to raise this issue!

I'm still unsure how this isn't explained by the nondeterministic ordering in polars, but reopened since @mattijn seems to get it

joelostblom commented 2 months ago

Might also be related to https://github.com/vega/vega-lite/issues/8675 which was reported in Altair here https://github.com/vega/altair/issues/3481.

mattijn commented 2 months ago

Yeah, The referenced VL issues are relevant here.

But just to be complete, what is happening. Altair tries to sort the fields in your column ascending when defined as type str on an encoding channel.

So when having this data:

import polars as pl
import altair as alt

df = pl.DataFrame({"value": [2, 5, 3], "month": ["jan", "feb", "mar"]})

And visualising it with the month on the x-axis channel and the values on the color channel using a rect-mark

chart = alt.Chart(df).mark_rect().encode(
    x='month',
    color='value'
)
chart
image

It can be seen that the x-axis is ordered by feb, jan, mar, since the f comes for j in the alphabet.

So by casting the month column in the dataframe as being a categorical in order of appearance (default of polars). We get the following:

df_catg = df.with_columns(pl.col("month").cast(pl.Categorical))
chart_catg = alt.Chart(df_catg).mark_rect().encode(
    x='month',
    color='value'
)
chart_catg
image

The x-axis is now ordered by jan, feb, mar, like the order as is defined in the dataframe.

By comparing the Vega-Lite specification of both charts we notice that the categorical column is serialised differently.

Top chart, column is of type str and it becomes:

chart.to_dict()['encoding']['x']
{'field': 'month', 'type': 'nominal'}

With categorical column defined, it becomes:

chart_catg.to_dict()['encoding']['x']
{'field': 'month', 'sort': ['jan', 'feb', 'mar'], 'type': 'ordinal'}

The sort order is serlialized from the categorical definition of the month column in the DataFrame. All good so far!


Sidenote Observe that the type is also different, ordinal for the dataframe with month column defined as categorical and nominal for the month column just defined as str. The effect of this is that when you use the categorical month column for the color encoding channel it is treated as an ordered categorical and therefor adding a sequential color scheme, versus the default which provides distinct color values for str values:

alt.vconcat(
    chart.encode(color="month"), 
    chart_catg.encode(color="month")
).resolve_scale(
    color="independent"
)
image

But upon adding encoding channels such as row and column this logic for sorting categorical columns in the DataFrame is breaking the rendering when there are combinations that contains no data.

The following goes well, but the order of the column may be seen as not right.

import polars as pl
import altair as alt

df = pl.DataFrame(
    {
        "time": [0, 1, 0, 1, 0, 1, 0, 1],
        "value": [0, 5, 0, 5, 0, 5, 0, 5],
        "choice": ["A", "A", "B", "B", "A", "A", "B", "B"],
        "month": ["jan", "jan", "feb", "feb", "feb", "feb", "mar", "mar"],
    }
)

chart = alt.Chart(df, height=100, width=100).mark_line().encode(
    x='time',
    y='value',
    color='choice',
    row='choice',
    column='month'
)
chart
image

So upon defining the column month as categorical, the order of the months in the column encoding is correctly sorted, but the data within the plots are incorrect.

df_catg = df.with_columns(pl.col("month").cast(pl.Categorical))
chart_catg = alt.Chart(df_catg, height=100, width=100).mark_line().encode(
    x='time',
    y='value',
    color='choice',
    row='choice',
    column='month'
)
chart_catg
image

Leading indeed to data being drawn within the wrong subplot! So, indeed, be very careful here!

One can use the following workaround when having a polars DataFrame as in OP:

df_complete = (
    df.select(pl.col(["choice", "month"]).unique().implode())
    .explode("choice")
    .explode("month")
    .join(df, how="left", on=["choice", "month"])
)

df_complete_sorted = df_complete.sort(pl.col("month").cast(pl.Enum(["jan", "feb", "mar"])))
df_complete_catg = df_complete_sorted.with_columns(pl.col("month").cast(pl.Categorical))
df_complete_catg
image
chart_complete_catg = alt.Chart(df_complete_catg, height=100, width=100).mark_line().encode(
    x='time',
    y='value',
    color='choice',
    row='choice',
    column='month'
)
chart_complete_catg
image

This workaround basically makes sure that all combinations that are possible to make with the row/column channel encoding, are actually existing in the DataFrame, albeit filled with a null value.

wirhabenzeit commented 2 months ago

@mattijn Thanks for investigating! As far as I can see the issue arises on a group level, e.g. when grouping by facet, but also specifying encodings such as color or shape, then data gets misplaced as soon as any group (like row a, column b, color c, shape d) contains no data points. Could there be an automatic way of detecting this on the Altair side, and issuing a warning? Probably that’s difficult in case categories are derived using transformations etc?

mattijn commented 2 months ago

Thanks for your response. You mean you can introduce this behavior without a row/column encoding channel included? Do you have an example of this? That seems more troublesome and indeed require more feedback to the user. A warning at best or at least a note in the documentation.

wirhabenzeit commented 2 months ago

No, I think without rows/columns the issue is not there. What I meant is that problems arises as soon as in some facet some color/shape group has no data points. So for the workaround one would need to fill in nulls not only for empty facets but also empty groups within a facet. In my original example above the two plots are not just different in the sense that some entire facets are in the wrong place, but the individual facets are also different. I can try to produce a more minimal example showing this.

mattijn commented 2 months ago

There seems something going on with polars too. First, if I do

import polars as pl
import vega_datasets
df = pl.DataFrame(vega_datasets.data.cars()).with_columns(
    pl.col("Origin", "Cylinders").cast(pl.String).cast(pl.Categorical("lexical"))
)

The order is correct in the chart, but when doing:

df['Cylinders'].cat.get_categories().to_list() 

I get

['8', '4', '6', '3', '5']

So it is not really clear to me, how the chart specification can know the right order.

But if I try to force the categorical order using an Enum:

import vega_datasets
import polars as pl

df = pl.from_pandas(vega_datasets.data.cars()).with_columns(
    pl.col("Origin"),
    pl.col("Cylinders").cast(pl.String).cast(pl.Categorical),
)
uniq_cylinders = df['Cylinders'].unique().to_list() 
print('cast Enum', sorted(uniq_cylinders))

df_sort = df.sort(pl.col('Cylinders').cast(pl.Enum(sorted(uniq_cylinders))))  # ['3', '4', '5', '6', '8']
df_catg = df_sort.with_columns(pl.col('Cylinders').cast(pl.Categorical))

df_catg['Cylinders'].cat.get_categories().to_list()

It returns

cast Enum ['3', '4', '5', '6', '8']
['8', '4', '6', '3', '5']

And a wrongly sorted chart. @dangotbanned, do you know more about this behaviour of polars?

dangotbanned commented 2 months ago

And a wrongly sorted chart. @dangotbanned, do you know more about this behaviour of polars?

@mattijn I can help but could you add some comments - explaining the intention behind each action you've taken please?

But if I try to force the categorical order using an Enum:

Code block
import vega_datasets
import polars as pl

df = pl.from_pandas(vega_datasets.data.cars()).with_columns(
    pl.col("Origin"),
    pl.col("Cylinders").cast(pl.String).cast(pl.Categorical),
)
uniq_cylinders = df['Cylinders'].unique().to_list() 
print('cast Enum', sorted(uniq_cylinders))

df_sort = df.sort(pl.col('Cylinders').cast(pl.Enum(sorted(uniq_cylinders))))  # ['3', '4', '5', '6', '8']
df_catg = df_sort.with_columns(pl.col('Cylinders').cast(pl.Categorical))

df_catg['Cylinders'].cat.get_categories().to_list()

I'm having trouble understanding as this reads more like pandas than polars code

My immediate thoughts are:

I should have elaborated in https://github.com/vega/altair/issues/3588#issuecomment-2344444054 but to me the issue seems to be wanting some explicit behavior - without using any of the explicit features of polars.

So one way to look at this, is if you tell polars to do something it will try to optimize for the fastest query to get there. However if you have some constraint that hasn't been defined - then you may be surprised when that gets optimized out.

Maybe this section of their user guide would be helpful?

Also https://docs.pola.rs/user-guide/concepts/data-types/categoricals/

mattijn commented 2 months ago

I notice one thing what is different.

If I define the dataframe as you suggested using a lexical option within the pl.Categorical() it is not persisted or included when compiling to Vega-Lite:

df = pl.DataFrame(vega_datasets.data.cars()).with_columns(
    pl.col("Origin", "Cylinders").cast(pl.String).cast(pl.Categorical("lexical"))
)

chart = alt.Chart(df).mark_point().properties(width=100, height=100).encode(
    x="Horsepower",
    y="Miles_per_Gallon",
    shape="Cylinders",
    color="Origin",
).facet(row="Origin", column="Cylinders")

print(df.get_column("Cylinders").cat.get_categories())
print(chart.to_dict()['facet'])
shape: (5,)
Series: 'Cylinders' [str]
[
    "8"
    "4"
    "6"
    "3"
    "5"
]
{'column': {'field': 'Cylinders', 'type': 'nominal'}, 'row': {'field': 'Origin', 'type': 'nominal'}}

As you can see, there is no sort defined for the column encoding channel. Therefor the order is correct in this case, but as a false-positive.

Where in my ugly (no fun indeed!) defined DataFrame it actually includes the sort for the column encoding channel.

{'column': {'field': 'Cylinders', 'sort': ['8', '4', '6', '3', '5'], 'type': 'ordinal'}, 'row': {'field': 'Origin', 'type': 'nominal'}}

Long story short, how does the inference works of a polars column casted as a lexical categorical? Is it correct that there is no sort definition in the corresponding Vega-Lite specification?

dangotbanned commented 2 months ago

Long story short, how does the inference works of a polars column casted as a lexical categorical? Is it correct that there is no sort definition in the corresponding Vega-Lite specification?

Thanks @mattijn for the detail!

So this part can be answered (I think) with narwhals.is_ordered_categorical and alt.utils.core.infer_vegalite_type_for_narwhals:

https://github.com/vega/altair/blob/a171ce8cb2f0b0cb0c944ddbd0c0623282570c0c/altair/utils/core.py#L712-L729

From what I'm understanding of https://github.com/narwhals-dev/narwhals/blob/aed2d515a2e26465a6edecf8d7aa560353cbdfa2/narwhals/utils.py#L401-L407

The type will be

cc @MarcoGorelli to double check

Edit

Misunderstood that nw.Enum != pl.Enum nw.Enum can represent more than only pl.Enum

narwhals.is_ordered_categorical

For Polars: Enums are always ordered. Categoricals are ordered if dtype.ordering == "physical".

mattijn commented 2 months ago

Thanks for adding more info on the table! But I'm not sure if I can read an answer in this already. Or can I understand from here that a categorical with dtype.ordering == "lexical" is intentionaly not ordered? And therefor casting to pl.Categorical('lexcial') is correctly not adding a sort argument to the Vega-Lite specification?

dangotbanned commented 2 months ago

Thanks for adding more info on the table! But I'm not sure if I can read an answer in this already. Or can I understand from here that a categorical with dtype.ordering == "lexical" is intentionaly not ordered? And therefor casting to pl.Categorical('lexcial') is correctly not adding a sort argument to the Vega-Lite specification?

@mattijn no worries, yeah you've understood that correctly

MarcoGorelli commented 2 months ago

Thanks for the ping!

Misunderstood that nw.Enum != pl.Enum

I think they should be the same? As in, pl.Enum should be recognised as nw.Enum:

In [21]: nw.from_native(pl.Series(['a', 'b', 'c'], dtype=pl.Enum(['b', 'a', 'c', 'd'])), allow_series=True).dtype == nw.Enum
Out[21]: True

Regarding physical vs lexical, I don't think that get_categories reflects the order - but maybe it should? The difference can be seen if you compare the categories, e.g. in a sort:

In [22]: pl.Series(['b', 'a', 'c'], dtype=pl.Categorical('lexical')).sort()
Out[22]:
shape: (3,)
Series: '' [cat]
[
        "a"
        "b"
        "c"
]

In [23]: pl.Series(['b', 'a', 'c'], dtype=pl.Categorical('physical')).sort()
Out[23]:
shape: (3,)
Series: '' [cat]
[
        "b"
        "a"
        "c"
]

but they both return the same output for .cat.get_categories(). Do you need the output of .cat.get_categories to reflect the category ordering?


nw.is_ordered_categorical just does what Polars does in its dataframe interchange protocol definition:

https://github.com/pola-rs/polars/blob/501988ea1c2a114e4c28619727157354211af93a/py-polars/polars/interchange/column.py#L60-L78

        if dtype == Categorical:
            categories = self._col.cat.get_categories()
            is_ordered = dtype.ordering == "physical"  # type: ignore[attr-defined]
        elif dtype == Enum:
            categories = dtype.categories  # type: ignore[attr-defined]
            is_ordered = True
        else:
            msg = "`describe_categorical` only works on categorical columns"
            raise TypeError(msg)

the interchange protocol definition is a bit vague here, it just says "whether the ordering of dictionary indices is semantically meaningful"

dangotbanned commented 2 months ago

Thanks @MarcoGorelli

Thanks for the ping!

Misunderstood that nw.Enum != pl.Enum

I think they should be the same? As in, pl.Enum should be recognised as nw.Enum:

In [21]: nw.from_native(pl.Series(['a', 'b', 'c'], dtype=pl.Enum(['b', 'a', 'c', 'd'])), allow_series=True).dtype == nw.Enum
Out[21]: True

So I goofed on this one 🤦‍♂️

In https://github.com/vega/altair/issues/3588#issuecomment-2346275689 I was trying to explain this bit where nw.Enum is representing non-polars Enums.

AFAIK pl.Enum wouldn't reach that branch: https://github.com/vega/altair/blob/a171ce8cb2f0b0cb0c944ddbd0c0623282570c0c/altair/utils/core.py#L712-L722

Maybe I should've wrote nw.Enum >= pl.Enum - or skipped the operators entirely

MarcoGorelli commented 2 months ago

Regarding the original post, I can reproduce it with Altair 5.3.0, but not with Altair 5.4.1

image

I can still reproduce it in Altair 5.4.1 however, if I use pl.Categorical('lexical') instead of pl.Categorical

EDIT: this comment is outdated, please ignore - once I did uv cache clean and reinstalled, I could indeed reproduce the issue as-reported

MarcoGorelli commented 2 months ago

I think the issue reproduces with pandas ordered categoricals too, both on Altair 5.3.0 and Altair 5.4.1

image

code:

import altair as alt
import pandas as pd

df_catg2 = pd.DataFrame(
    {
        "time": [0, 1, 0, 1, 0, 1, 0, 1],
        "value": [0, 5, 0, 5, 0, 5, 0, 5],
        "choice": ["A", "A", "B", "B", "A", "A", "B", "B"],
        "month": ["jan", "jan", "feb", "feb", "feb", "feb", "mar", "mar"],
    }
)

df_catg2["month"] = df_catg2["month"].astype(pd.CategoricalDtype(ordered=True))
chart_catg2 = (
    alt.Chart(df_catg2, height=100, width=100)
    .mark_line()
    .encode(x="time", y="value", color="choice", row="choice", column="month")
)
chart_catg2
mattijn commented 2 months ago

Pff, complicated. Altair assumes that the returned categories are in sorted order when it is defined as ordered, but this is an assumption that does not always hold.

df = pl.from_dict({"cats": ['z', 'z', 'k', 'a', 'b'], "vals": [3, 1, 2, 2, 3]}) df = df.with_columns(pl.col("cats").cast(pl.Enum(my_order)))

nw_s = nw.from_native(df.get_column("cats"), allow_series=True) print('ordered according to narwhals:', nw.is_ordered_categorical(nw_s)) print('sort order of get categories:', df.get_column("cats").cat.get_categories().to_list())

chart = alt.Chart(df, title='pl.Enum(my_order)').mark_bar().encode( x='vals', y='cats' ) print('y-encoding sort definition:', chart.to_dict()['encoding']['y']) chart

> <img width="684" alt="image" src="https://github.com/user-attachments/assets/aeb40a82-eb72-44a3-acca-812905bfab98">

- Physical categorical is going OK in Altair. 
The column is seen as an ordered categorical. `get_categories()` returns physical categorical sorted in physical order. Result is that the physical ordered list from `get_categories()` is used to `sort` the `y`-encoding channel in the following chart.
```python
df = pl.from_dict({"cats": ['12', '4', '2'], "vals": [3, 1, 2]})
df = df.with_columns(pl.col("cats").cast(pl.Categorical()))  # 'physical'

nw_s = nw.from_native(df.get_column("cats"), allow_series=True)
print('ordered according to narwhals:', nw.is_ordered_categorical(nw_s))
print('sort order of get categories:', df.get_column("cats").cat.get_categories().to_list())

chart = alt.Chart(df, title='pl.Categorical()').mark_bar().encode(
    x='vals',
    y='cats'
)
print('y-encoding sort definition:', chart.to_dict()['encoding']['y'])
chart
image

nw_s = nw.from_native(df.get_column("cats"), allow_series=True) print('ordered according to narwhals:', nw.is_ordered_categorical(nw_s)) print('sort order of get categories:', df.get_column("cats").cat.get_categories().to_list())

chart = alt.Chart(df, title="pl.Categorical('lexical')").mark_bar().encode( x='vals', y='cats' ) print('y-encoding sort definition:', chart.to_dict()['encoding']['y']) chart


> <img width="684" alt="image" src="https://github.com/user-attachments/assets/f6579f4f-a8d7-4a11-96ac-1accb5115421">

To support lexical categorical, it should 
1. Be considered as ordered by narwhals. 
2. The sort order of the `get_categories()` should be reflecting the lexical order.

Current implemention of `nw.is_ordered_categorical` only allows order to be defined based on numeric values and not on alphabet (lexical).

Apparently the situation is different for pandas ordered categorical. Since it does not always return the _sorted_ physical ordered categorical.
***

Btw. When trying OP as you did in https://github.com/vega/altair/issues/3588#issuecomment-2347221524. I get this:

```python
import polars as pl
import vega_datasets
import altair as alt
alt.Chart(
    pl.from_pandas(vega_datasets.data.cars()).with_columns(
        pl.col("Origin").cast(pl.Categorical),
        pl.col("Cylinders").cast(pl.String).cast(pl.Categorical),
    )
).mark_point().properties(width=150, height=150).encode(
    x="Horsepower",
    y="Miles_per_Gallon",
    shape="Cylinders",
    color=alt.Color("Origin").scale(scheme="category10"),
).facet(row="Origin", column="Cylinders").to_dict()['facet']
{'column': {'field': 'Cylinders',
  'sort': ['8', '4', '6', '3', '5'],
  'type': 'ordinal'},
 'row': {'field': 'Origin',
  'sort': ['USA', 'Europe', 'Japan'],
  'type': 'ordinal'}}

With ('5.5.0dev', '1.6.2') for alt.__version__, nw.__version__. Reflecting the behaviour you have when using Altair version 5.3.0...

MarcoGorelli commented 2 months ago

Reflecting the behaviour you have when using Altair version 5.3.0...

Right, sorry about that, I just did uv cache clean, reinstalled everything, and indeed I can reproduce the original post - I've marked my previous comment as outdated

Physical categorical is going OK in Altair.

Are you sure about this? It seems to me that anything which is auto-inferred to be "ordinal" (as opposed to "nominal") is subject to issues

For example, if we start with

import altair as alt
import pandas as pd
import polars as pl

df_cat = pd.DataFrame(
    {
        "time": [0, 1, 0, 1, 0, 1, 0, 1],
        "value": [0, 5, 0, 5, 0, 5, 0, 5],
        "choice": ["A", "A", "B", "B", "A", "A", "B", "B"],
        "month": ["jan", "jan", "feb", "feb", "feb", "feb", "mar", "mar"],
    }
)

then:

pandas ordered categorical: 'ordinal', incorrect data

df_cat["month"] = df_cat["month"].astype(pd.CategoricalDtype(ordered=True))
chart_cat = (
    alt.Chart(df_cat, height=100, width=100)
    .mark_line()
    .encode(x="time", y="value", color="choice", row="choice", column="month")
)
chart_cat

pandas unordered categorical: 'nominal', correct data (but wrong ordering)

df_cat["month"] = df_cat["month"].astype(pd.CategoricalDtype(ordered=False))
chart_cat = (
    alt.Chart(df_cat, height=100, width=100)
    .mark_line()
    .encode(x="time", y="value", color="choice", row="choice", column="month")
)
chart_cat

Polars physical categorical: 'ordinal', incorrect data

df_cat = pl.from_pandas(df_cat).with_columns(
    pl.col('month').cast(pl.Categorical('physical'))
)
chart_cat = (
    alt.Chart(df_cat, height=100, width=100)
    .mark_line()
    .encode(x="time", y="value", color="choice", row="choice", column="month")
)
chart_cat

Polars lexical categorical: 'nominal', correct data (but wrong ordering)

df_cat = pl.from_pandas(df_cat).with_columns(
    pl.col('month').cast(pl.Categorical('lexical'))
)
chart_cat = (
    alt.Chart(df_cat, height=100, width=100)
    .mark_line()
    .encode(x="time", y="value", color="choice", row="choice", column="month")
)
chart_cat

To support lexical categorical, it should

  1. Be considered as ordered by narwhals.
  2. The sort order of the get_categories() should be reflecting the lexical order.

I've tried doing this, but then the output from the example above becomes incorrect for both physical and lexical

mattijn commented 2 months ago

Not sure about anything anymore, but I think we have identified at least four issues/anomalies by now:

  1. When using row/column encoding channel in combination with a sort parameter it will place your data in incorrect subplots if some of the panels has no any defined data. Related issue defined in VL https://github.com/vega/vega-lite/issues/5937
  2. A pandas ordered categorical returns its categories lexical sorted, not physical sorted
    pd.Series(['4', '2', '12'], dtype=pd.CategoricalDtype(ordered=True)).cat.categories.to_list()
    ['12', '2', '4']  # physical sorted is ['4', '2', '12']
  3. A polars lexical categorical returns its categories physical sorted, not lexical sorted
    pl.Series(['4', '2', '12']).cast(pl.Categorical('lexical')).cat.get_categories().to_list()
    ['4', '2', '12']  # lexical sorted is ['12', '2', '4']
  4. A pre-cached lexical sorted categorical remains lexical sorted upon casting to physical categorical in polars (pl_s1)
    
    s1 = pd.Series(['4', '2', '12'], dtype='category')
    s2 = pd.Series(['4', '2', '12'])

pl_s1 = pl.from_pandas(s1).cast(pl.Categorical('physical')).cat.get_categories().to_list() pl_s2 = pl.from_pandas(s2).cast(pl.Categorical('physical')).cat.get_categories().to_list() pl_s1, pl_s2


> ```python
> (['12', '2', '4'], ['4', '2', '12'])
> ```
For clarity, data without defined categorical is returning its categories sorted in physical order when casted to physical categorical in polars (`pl_s2`)

***

Regarding my comment, a few clarification notes in _[italic]_:

> To support _[inference of columns with its type casted as]_ lexical categorical, _[the column]_ should
> 
> 1. Be considered as ordered by narwhals.
> 2. The sort order of the `get_categories()` _[of this column]_ should be reflecting the lexical order.

So basically, for proper dataframe inference of ordered categoricals then:
- A lexical categorical should return its categories sorted in lexical order.
- A physical categorical should return its categories sorted in physical order.

Also meaning, that this will currently lead to data being placed in wrong subplots if there are panels without data for both lexical and pysical ordered categoricals, since there will be a `sort` defined for the `row` / `column` encodings, as is described in point 1 in this comment.
dangotbanned commented 2 months ago

Pinging @c-peters for additional context, as they may have the best understanding of pl.Categorical:

joelostblom commented 2 months ago

Forgive me if there is something I am misunderstanding, but it seems like all the issues reported here could stem from VegaLite not handling the sort keyword correctly when faceting into rows and columns as per https://github.com/vega/vega-lite/issues/5937. I think it is difficult to properly troubleshoot what is happening with the categorical field sorting in row and colun facets until this is fixed in VegaLite.

Outside row and column faceting, all the scenarios with pd and pl categories work as expected as far as I can see (i.e. the order of the color scale match the order of the categories in each of these examples):

pd ordered

Identified as ordinal as expected:

image

Changing the categorical order changes the color scale order:

image

pd unordered

Identified as nominal as expected:

image

pl physical

Identified as ordinal as expected:

image

pl lexical

As already pointed out above, it seems that identification of lexical categories as ordinal data is not yet supported since it is not indicated as categorical data by narwhals and thus we get back unsorted nominal data:

image

Which would be the same as if we used the pl physical data frame and explicitly encoded the data type as nominal:

image

c-peters commented 1 month ago

I'm not too familiar with what happens in Altair / Narwhals, but indeed the call to get_categories does not return the categories in sorted order.

Would it be possible to call .sort() for the lexical ordered case: pl.col("category_column").cat.get_categories().sort()

dangotbanned commented 1 month ago

Related