has2k1 / plotnine

A Grammar of Graphics for Python
https://plotnine.org
MIT License
3.89k stars 209 forks source link

geom_bar and order of categorical variables. #810

Closed fkgruber closed 2 weeks ago

fkgruber commented 2 weeks ago

It appears that geom_bar does not respect the order of categorical variables. For example, I wanted to replicated the following R plot:

library(tidyverse)

df = bind_rows(
  tibble(type="A",
         value=factor(c("<0", "0", "(0,1]","(1,3]"),
                      levels=c("<0", "0", "(0,1]","(1,3]"),
                      ordered=F),
         count=c(10, 1, 5, 8)
         ),
  tibble(type="B",
         value=factor(c("<0", "0", "(0,2]","(2,4]"),
                      ordered=F,
                      levels=c("<0", "0", "(0,2]","(2,4]")),
         count=c(5,2,10,3)), 
  tibble(type="C",
         value=factor(c("<0", "0", "(0,2.3]","(2.3,4.2]"),
                      ordered=F,
                      levels=c("<0", "0", "(0,2.3]","(2.3,4.2]")),
         count=c(3,7,8,5)
         )
)

df %>%
  ggplot(aes(x=value, y=count)) + geom_bar(stat="identity") +
  facet_wrap(~type, scales="free")

image

In plotnine the order of the categorical variables is not respected:

import pandas as pd
import numpy as np
from plotnine import ggplot, aes, geom_bar, facet_wrap

# Create dataframes similar to the R tibble() function
df_a = pd.DataFrame({
    'type': 'A',
    'value': pd.Categorical(["<0", "0", "(0,1]", "(1,3]"],
                            categories=["<0", "0", "(0,1]", "(1,3]"],
                            ordered=False),
    'count': [10,1,5,8]
})

df_b = pd.DataFrame({
    'type': 'B',
    'value': pd.Categorical(["<0", "0", "(0,2]", "(2,4]"],
                            categories=["<0", "0", "(0,2]", "(2,4]"],
                            ordered=False),
    'count': [5,2,10,3]
})

df_c = pd.DataFrame({
    'type': 'C',
    'value': pd.Categorical(["<0", "0", "(0,2.3]", "(2.3,4.2]"],
                            categories=["<0", "0", "(0,2.3]", "(2.3,4.2]"],
                            ordered=False),
    'count': [3,7,8,5]
})

# Combine the dataframes
df = pd.concat([df_a, df_b, df_c], ignore_index=True)

# Create the plot using plotnine
(
    ggplot(df, aes(x='value', y='count')) +
    geom_bar(stat='identity') +
    facet_wrap('~type', scales='free')
)

image

has2k1 commented 2 weeks ago

plotnine the respects order of categorical variables.

In your translation, this

# Combine the dataframes
df = pd.concat([df_a, df_b, df_c], ignore_index=True)

part assumes that df["value"] is a categorical, but it is not.

Try

# Combine the dataframes
df = pd.concat([df_a, df_b, df_c], ignore_index=True)
df["value"] = df["value"].astype(pd.CategoricalDtype(
    ["<0", "0", "(0,1]", "(1,3]", "(0,2]", "(2,4]", "(0,2.3]", "(2.3,4.2]"]
))
fkgruber commented 2 weeks ago

yes that works thanks!