vega / altair

Declarative statistical visualization library for Python
https://altair-viz.github.io/
BSD 3-Clause "New" or "Revised" License
9.28k stars 793 forks source link

Allow to reference data from index (of pandas dataframes) #3331

Open fhg-isi opened 8 months ago

fhg-isi commented 8 months ago

"By design Altair only accesses dataframe columns, not dataframe indices": https://altair-viz.github.io/user_guide/data.html#including-index-data

Please consider to support indexed pandas dataframes in a future altair version. Also see

https://stackoverflow.com/questions/77993730/how-to-use-indexed-data-frames-with-altair/

binste commented 8 months ago

Thanks for the suggestion. I see that it's easier to not have to write .reset_index(). The challenge is that Altair would need to call .reset_index() internally for every Pandas dataframe to make the index accessible. In many cases where the index is not needed for the chart, it would lead to unnecessary data being added to the Vega-Lite specification, see Altair Internals.

For this to work, Altair would need to know when the index is used and when it isn't so it can call .reset_index in only those cases. Altair can only do this with the help of VegaFusion, see https://github.com/altair-viz/altair/issues/2428 for details on why.

In short, I think it adds a lot of complexity to make this work and it would only work with VegaFusion which is an additional dependency. I'll leave this open in case I'm missing something.

fhg-isi commented 8 months ago

Also see

https://stackoverflow.com/questions/20084487/use-index-in-pandas-to-plot-data

import pandas as pd

df = pd.DataFrame(
    [
        {'id_foo': 1, 'energy_carrier': 'oil', '2000': 5, '2020': 10},
        {'id_foo': 2, 'energy_carrier': 'electricity', '2000': 10, '2020': 20},
    ]
)

print(type(df.index))   # <class 'pandas.core.indexes.range.RangeIndex'>

indexed_df = df.pivot_table(
    columns='energy_carrier',
    values=['2000', '2020'],
    aggfunc='sum',
)

print(type(indexed_df.index))  # <class 'pandas.core.indexes.base.Index'>

df.set_index('id_foo', inplace=True)  # <class 'pandas.core.indexes.numeric.Int64Index'>

print(type(df.index))
binste commented 8 months ago

It's always good to get inputs how other people use the library. Where do you see the downside of simply doing alt.Chart(df.reset_index()).encode(x="index")? It's a few more characters (.reset_index()) to type so if it's easy to get rid of it, I'd agree that it's good to do it, but I don't think it is.

  • Index could be accessed without resetting the data_frame (e.g. df.index.values) ?

We need the index as a proper column in the dataframe as Altair then needs to convert the Pandas dataframe to JSON (via a dictionary representation with df.to_dict()).

  • Type of index could be checked ? If its not RangeIndex, consider it as explicit index ?

Yep we could use this: Call .reset_index internally when it's not a RangeIndex. I think this is the best approach so far. We'd need to think through if this has any unintended side-effects. Does Pandas copy the whole dataframe when doing .reset_index? That might use too much memory in some cases and also slow down chart creation.

  • Reset could only be called in the required cases where some expression like "$index" is used ?

It's very tricky to parse all expressions as they can appear in many places. Right now, this requires VegaFusion as mentioned above.

  • Spend extra method or option for indexed dataframes (use_index=True) ) ?

We could do that but it feels easier to just let a user do .reset_index, about the same amount of characters to type.

  • Allow index as type for mapping x = df.index ?

Same reason as with the first suggestion, we need it as a column in the dataset. This is a requirement for generating the Vega-Lite specification (JSON).