pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.69k stars 1.99k forks source link

Columns names with a dot (`.`) do not plot correctly with the new Altair plotting backend #19735

Open heshamdar opened 3 weeks ago

heshamdar commented 3 weeks ago

Checks

Reproducible example

import polars as pl

df = pl.DataFrame({'x.1':[1,2,3,4,5], 'x.2':[1,2,3,4,5]})
df.plot.scatter(x='x.1', y='x.2')

Log output

No response

Issue description

Columns that include a dot . in the name are parsed incorrectly by Altair and so the resulting plots are empty. This behaviour is described here: https://github.com/vega/vega-lite/issues/1775

The suggested fix (as mentioned in the discussion) is to wrap the column name in square brackets.

In the provided example the plotting config that gets generated is:

{
  "config": {"view": {"continuousWidth": 300, "continuousHeight": 300}},
  "data": {"name": "data-71d7206cf897da6faf64633a8390c7ee"},
  "mark": {"type": "point", "tooltip": true},
  "encoding": {
    "x": {"field": "x.1", "type": "quantitative"},
    "y": {"field": "x.2", "type": "quantitative"}
  },
  "params": [
    {
      "name": "param_10",
      "select": {"type": "interval", "encodings": ["x", "y"]},
      "bind": "scales"
    }
  ],
  "$schema": "https://vega.github.io/schema/vega-lite/v5.20.1.json",
  "datasets": {
    "data-71d7206cf897da6faf64633a8390c7ee": [
      {"x.1": 1, "x.2": 1},
      {"x.1": 2, "x.2": 2},
      {"x.1": 3, "x.2": 3},
      {"x.1": 4, "x.2": 4},
      {"x.1": 5, "x.2": 5}
    ]
  }
}

If the config instead includes square brackets in the field encoding it plots correctly:

{
  "config": {"view": {"continuousWidth": 300, "continuousHeight": 300}},
  "data": {"name": "data-71d7206cf897da6faf64633a8390c7ee"},
  "mark": {"type": "point", "tooltip": true},
  "encoding": {
    "x": {"field": "[x.1]", "type": "quantitative"},
    "y": {"field": "[x.2]", "type": "quantitative"}
  },
  "params": [
    {
      "name": "param_10",
      "select": {"type": "interval", "encodings": ["x", "y"]},
      "bind": "scales"
    }
  ],
  "$schema": "https://vega.github.io/schema/vega-lite/v5.20.1.json",
  "datasets": {
    "data-71d7206cf897da6faf64633a8390c7ee": [
      {"x.1": 1, "x.2": 1},
      {"x.1": 2, "x.2": 2},
      {"x.1": 3, "x.2": 3},
      {"x.1": 4, "x.2": 4},
      {"x.1": 5, "x.2": 5}
    ]
  }
}

Expected behavior

Plots generated using columns with . are handled correctly

Installed versions

``` --------Version info--------- Polars: 1.12.0 Index type: UInt32 Platform: macOS-14.2.1-arm64-arm-64bit Python: 3.10.4 (main, May 28 2022, 12:23:44) [Clang 13.1.6 (clang-1316.0.21.2.5)] LTS CPU: False ----Optional dependencies---- adbc_driver_manager altair 5.4.1 cloudpickle connectorx deltalake fastexcel 0.10.4 fsspec gevent great_tables 0.10.0 matplotlib 3.9.1.post1 nest_asyncio 1.6.0 numpy 1.26.4 openpyxl pandas 2.2.2 pyarrow 16.1.0 pydantic pyiceberg sqlalchemy torch xlsx2csv xlsxwriter ```
Julian-J-S commented 3 weeks ago

This has nothing to do with polars 😉

But I agree. This is annoying and bad design imo. You dont always control variable names. I had the same "problem" a few month ago. https://github.com/vega/altair/issues/3599 Unfortunately this is intended behavior of altair (see documentation)

orlp commented 3 weeks ago

I think we can escape the column names when altair is set as backend?

MarcoGorelli commented 3 weeks ago

thanks @heshamdar for the report!

I think we can escape the column names when altair is set as backend?

sure but if the user's not aware of this being done for them, and they just pass x='x.1' instead of x='x\.1', then they would get

ValueError: Unable to determine data type for the field "x.1"; verify that the field name is not misspelled. If you are referencing a field from a transform, also confirm that the data type is specified correctly.

Maybe the solution is to just raise an informative error from the start if the user's input contains columns with problematic names (like '.', ':'), and just refuse to make a plot at all in such situations? Maybe this should even be done in Altair itself?