vega / altair

Declarative statistical visualization library for Python
https://altair-viz.github.io/
BSD 3-Clause "New" or "Revised" License
9.23k stars 784 forks source link

Include autocompletion for column names #3213

Open joelostblom opened 11 months ago

joelostblom commented 11 months ago

Follow up from discussion with @binste in https://github.com/altair-viz/altair/pull/3208#issuecomment-1742136253.

It would be helpful to enable autocompletion of column names when doing .encode(x=<TAB> and/or alt.X(<TAB>. pandas already autocompletes column names when you do df['<TAB> (in all environments or just ipython-based?) and IPython does something similar for dictionary keys, see https://github.com/ipython/ipython/pull/13745. Maybe there is something there we can reuse.

binste commented 4 months ago

I was really hoping this is possible but I'm starting to doubt it. Ibis solves the table['TAB respectively by implementing a _ipython_key_completions_ method which returns the available keys. See https://github.com/ibis-project/ibis/blob/main/ibis/expr/types/relations.py#L860 for the code and https://ipython.readthedocs.io/en/stable/config/integrating.html#tab-completion for the documentation. Pandas probably does the same.

However, we don't use square brackets so this won't be triggered. For autocompletion sources of possible values of class/function arguments, I'm only aware of Literal type hints which was what triggered this discussion. But we can't set them dynamically at runtime.

With .dot access, we could overwrite alt.X.__dir__ to provide suggestions -> alt.X.<TAB>. But in alt.X.__dir__ we don't know anything about the chart object and hence the dataset which is used.

Is anyone aware of a library which achieved something similar?

dangotbanned commented 4 months ago

Is anyone aware of a library which achieved something similar? @binste

Just a quick sketch, but this is an option if you want key completions.

Requires python>=3.12.0 for syntax, just using what I had running. Can easily be adapted for 3.8.

from collections.abc import Mapping
from typing import Protocol, runtime_checkable

import altair as alt
import polars as pl

@runtime_checkable
class _SupportsIPython(Protocol):
    # pandas/polars
    def _ipython_key_completions_(self) -> list[str]: ...

@runtime_checkable
class _SupportsColumnNames(Protocol):
    # pa.Table
    @property
    def column_names(self) -> list[str]: ...

type Proxyable = _SupportsIPython | _SupportsColumnNames | Mapping[str, Any]

class KeyCompletionsProxy:
    def __init__(self, _data: Proxyable, /) -> None:
        self._data = _data

    def _ipython_key_completions_(self) -> list[str]:
        if isinstance(self._data, _SupportsIPython):
            return self._data._ipython_key_completions_()
        elif isinstance(self._data, Mapping):
            return list(self._data.keys())
        elif isinstance(self._data, _SupportsColumnNames):
            return self._data.column_names
        else:
            raise TypeError(self._data)

    def __getitem__[T](self, item: T) -> T:
        return item

data = {"column_1": [1, 2, 3], "column_2": [5, 1, 4], "another_column": [1, 2, 3]}
df = pl.DataFrame(data)

kcp = KeyCompletionsProxy(df)
kcp_2 = KeyCompletionsProxy(data)
kcp_3 = KeyCompletionsProxy(df.to_arrow())

alt.Chart(df).encode(x=kcp["column_1"], y=kcp["column_2"]).mark_point(color="red")

Visual Studio Code

image

You could have a factory for KeyCompletionsProxy at the top level, so a user could type:

import altair as alt

cols = alt.cols(df)
alt.Chart(df).encode(x=cols["column_1"], ...)
joelostblom commented 4 months ago

Thanks for the suggestion @dangotbanned. Ideally we would avoid introducing an additional object that we need to complete from and have just x='col_name'. As @binste mentioned, it would be great if the possible values for the parameter x were dynamically updated based on the data specified to alt.Chart, but it seems like this is not possible.

@binste Although I also think this functionality would be neat, maybe it is less important now that many use copilot and similar tools for autocompletion, which in my experience are able to inspect the data objects and often suggest appropriate completion for the encoding parameters?

jonmmease commented 4 months ago

Another direction might be to support passing the dataframe columns themselves.

alt.Chart(df).mark_point().encode(x=df.column_1, y=df.column_2)

This way you get regular dataframe tab completion when selecting columns. plotly express does this. See https://plotly.com/python/px-arguments/#input-data-as-pandas-dataframes.

In addition to supporting columns from the source DataFrame itself, ploty express also supports passing DataFrame columns that have the same length without specifying a source DataFrame, which for us would be something like:

alt.Chart().mark_point().encode(x=df.column_1, y=df.column_2)

I don't think we'd necessarily need to go this far, but something to consider.

binste commented 4 months ago

Interesting sketch, thanks @dangotbanned! I'm also leaning towards not introducing an object just for the sake of autocompletion but it's interesting to know this could work.

I like the idea of accepting dataframe columns! We could simply parse out the names in encode. Would be only a minimal code change I think. I'm not sure how we would handle it if users then pass in arrays/columns which are not from the source dataset which is passed to alt.Chart. That could be somewhat confusing. Thoughts on this? Maybe even worth to go directly to the second code example of Jon although that looks complex to achieve...

jonmmease commented 4 months ago

First thing I would do is look at the details of how @nicolaskruchten accomplished this in plotly express. I think there's more going on than just grabbing the series name and assuming it belongs to the DataFrame. I think it would be important to at least validate that it's actually a column from the DataFrame (but not sure how hard that will be to get access to from inside the .encode method).

Then if it's not a column from the source DataFrame we could either error, or add it to the DataFrame if it has a compatible length and index.

Of course this is very pandas centric, so we'd also want to consider if there is a way to accomplish this with DataFrame Interchange Protocol DataFrames as well.

nicolaskruchten commented 3 months ago

I don't actually think I got anything fancy going in PX... we just used strings and the df.<tab> trick basically. We also accepted just doing method(x=df.whatever, y=df.whatever) without having to separately supply df (and we construct the DF we need under the hood) so folks can just do px.scatter(x=[1,2,3], y=[4,5,6]) or whatever