has2k1 / plotnine

A Grammar of Graphics for Python
https://plotnine.org
MIT License
3.92k stars 210 forks source link

geom_col drops fill groups when using position_dodge #743

Closed AlFontal closed 6 months ago

AlFontal commented 6 months ago

I've been scratching my head quite a bit with this one. I've finally achieved to make a reproducible minimal example.

The context is the following:

I am using plotnine 0.12.1 here. I have a discrete variable in the x-axis, and a continuous variable in the y-axis. I have a second discrete variable that I use for the fill aesthetic, and a third discrete variable that I use as the group aesthetic. In some cases, for a single value of x I have multiple values of this third variable, yet I am interested in the total height of the individual columns so I use position='dodge'.

The problem is then that the plot generated seems to completely throw away several of the values of the fill aesthetic variable, reproducing a completely erroneous plot. Here the example:

alphabet = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
# I tested several groups/samples combinations but it always seems to be wrong
n_groups = 5
n_samples = 5
df = (
    pd.DataFrame(
        dict(
    taxa=np.tile(list(alphabet[:n_groups]), n_samples),
    sample_id=np.repeat(['S01', 'S02', 'S03', 'S04', 'S05'], n_groups),
    counts=np.random.randint(0, 100, n_groups * n_samples),
    date=np.repeat(['02/17', '02/24', '03/03', '03/03', '03/10'], n_groups)
        )))
df
taxa sample_id counts date
A S01 41 02/17
B S01 88 02/17
C S01 43 02/17
D S01 30 02/17
E S01 54 02/17
A S02 26 02/24
B S02 19 02/24
C S02 99 02/24
D S02 74 02/24
E S02 83 02/24
A S03 0 03/03
B S03 56 03/03
C S03 34 03/03
D S03 19 03/03
E S03 67 03/03
A S04 92 03/03
B S04 64 03/03
C S04 83 03/03
D S04 86 03/03
E S04 13 03/03
A S05 73 03/10
B S05 27 03/10
C S05 87 03/10
D S05 73 03/10
E S05 14 03/10

If we were to now plot this data with the default position, the representation seems correct:

(p9.ggplot(dd)
      + p9.aes('date', 'counts', group='sample_id', fill='taxa')
      + p9.geom_col()
 )

image

However, since we want to have the total height of each individual sample, we use position='dodge' and the values are completely off:

(p9.ggplot(df)
      + p9.aes('date', 'counts', group='sample_id', fill='taxa')
      + p9.geom_col(position='dodge')
 )

image

The issue doesn't simply arise because of the differing width of the columns, as if we keep a one to one ratio between the x axis variable and the group variable but use the 'dodge' position, the representation is also wrong:

(p9.ggplot(df.query('sample_id!="S04"'))
      + p9.aes('date', 'counts', group='sample_id', fill='taxa')
      + p9.geom_col(position='dodge')
      )

image

Any ideas on what might be causing this? I will take a look later at the actual code in position_dodge.py but as of now I am quite clueless...

has2k1 commented 6 months ago
  1. You are doing a rather complicated dodge that overwhelms position_dodge. It does not throwaway any values, rather it just doesn't do enough dodging so the taller columns may overlap the shorter ones depending on the order. You can add

    geom_text(aes(label="counts"), position=position_dodge(width=0.9))

    to see that everything is there.

  2. When you have more than a single group at the same x location, the width of the bars is being split between the groups. This increases the appearance of position_dodge being confused!

The solution is to use the more capable position_dodge2 and tell it (preserve="single") to preserve the width of a single element in a group.

geom_col(position=position_dodge2(preserve="single"))