pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.25k stars 1.85k forks source link

Add syntax sugar over `pl.col()` using symbol parsing #7398

Closed hyiltiz closed 1 year ago

hyiltiz commented 1 year ago

Polars version checks

Issue description

Writing pl.col('some_column') is too long to type, esp. since working with columns is the main focus for most operations supported in Polars. Would it be possible to treat some special symbol as column names within functions that accept column name specifications?

Reproducible example

import polars as pl

df = pl.DataFrame(
    {
        "id": [1, 2, 3],
        "color": ["blue", "red", "green"],
        "size": ["small", "medium", "large"],
    }
)

# this is currently supported
multi_filter_df = df.filter((pl.col("id") <= 2) & (pl.col("size") == "small"))

# some syntax sugar over pl.col()
multi_filter_df = df.filter(:id <= 2) & (:size == "small"))

Expected behavior

Using :column (or some other notation since : is significant in python syntax) to provide syntax sugar over pl.col() would make specifying columns much easier.

Alternatives:

syntax sugar pros cons
:column symbol syntax used by Ruby and easy to type may not be possible as : is significant syntax in Python
$column familiarity due to variable name reference syntax in Bash may not be possible due to Python's not-so-flexible parser
@column may be possible to implement? may confuse users with magic syntax

Installed versions

``` ---Version info--- Polars: 0.16.4 Index type: UInt32 Platform: macOS-13.2-arm64-arm-64bit Python: 3.10.10 (main, Feb 9 2023, 02:35:44) [Clang 14.0.0 (clang-1400.0.29.202)] ---Optional dependencies--- pyarrow: pandas: 1.5.3 numpy: 1.24.2 fsspec: connectorx: xlsx2csv: deltalake: matplotlib: 3.7.0 ```
ghuls commented 1 year ago

I don't think variables are allowed to start with those characters.

To make them slightly shorter, you can do the following:

c = pl.col

multi_filter_df = df.filter((c("id") <= 2) & (c("size") == "small"))
jcmkk3 commented 1 year ago

As ghuls said, this isn't really possible. Python does not have a separate symbol literal like some languages have. For brevity's sake, the options are a single character identifier with parenthesis (c("comn_name")), brackets (c["column_name"]), or as attribute lookup (c.column_name). The final option would be to do some sort of string parsing like pandas query does, but at that point you can't really make use of any python language features since you've just created your own language.

zundertj commented 1 year ago

Another solution is to store (intermediate) expressions in Python variables:

id = pl.col("id")
size = pl.col("size")
my_filter = (id <= 2) & (size == "small")
multi_filter_df = df.filter(my_filter)

This extends very well once you start building more complex expressions, for example when you want to re-use the filter expression.

On the renaming suggestion, the cleanest way is to do this on import I would say:

from polars import col as c

But it may be annoying for other users who are not used to this convention. There was a time we also had in the Polars docs often col() rather than pl.col(), but the latter is more explicit and really helps future readers of your code as it brings consistency to the table. That is worth a couple characters.

gab23r commented 1 year ago

Just for info, with ibis you can do from ibis import _ df.filter(_.column_name == ...)

jcmkk3 commented 1 year ago

Just for info, with ibis you can do from ibis import _ df.filter(_.column_name == ...)

This would be like the attribute solution that I proposed above. I really hate that ibis made "_" an export. If folks want to use underscore as the alias then that would be their choice, but it is a poor choice for the library to promote. Underscore already has an idiomatic use in python for discarding a value. It would be really easy to accidentally stomp on your own code without thinking about it.

In polars, you could create a little helper like this one:

import polars as pl

class ColumnHelper:
    def __getattr__(self, name):
        return pl.col(name)

    def __getitem__(self, name):
        return pl.col(name)

c = ColumnHelper()

url = "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-07-28/penguins.csv"
penguins = pl.read_csv(url, null_values="NA")

# Access columns using attributes on the `ColumnHelper` instance
penguins.groupby(c.island, c.species).agg(c.body_mass_g.mean())

# Alternative when there are spaces or other invalid characters in name
penguins.groupby("island", "species").agg(c["body_mass_g"].mean())

# Also works with multiple columns
penguins.groupby(c.island, c.species).agg(c["bill_length_mm", "bill_depth_mm"].mean())

Some possible enhancements:

hyiltiz commented 1 year ago

While not ideal, it should still be possible to capture the symbol literal and bypass the limits of Python's parser to introduce new syntax-sugar [1].

This is definitely not quite Pythonic and as such, probably not ideal for a major library like polars to promote. Looks like ColumnHelper is the cleanest approach we have for now...

[1] https://github.com/alexmojaki/sorcery

alexander-beedie commented 1 year ago

Looks like ColumnHelper is the cleanest approach we have for now...

Yup 👍 (And, indeed, hacking on the python parser to introduce peculiar syntax is not the way to go :) Maybe we could introduce something like the ColumnHelper as an opt-in/importable utility if there's demand; it does offer quite compact/clean usage for the simple cases...

jcmkk3 commented 1 year ago

I don't have a good solution for it, but what I'd like to see is the ability to autocomplete column names when using pl.col or even something like the ColumnHelper. It would need to be something added at the tooling level and not the Polars library, but I think it would be a big usability improvement for interactive use. I imagine that it would ideally live somewhere like the jedi library and/or the various python LSP clients. If there could be some sort of standard, then other libraries like Ibis, Seaborn, Altair, etc. could also benefit.

aldanor commented 1 year ago

what I'd like to see is the ability to autocomplete column names when using pl.col

How would it know where to pick column names from? (given that pl.col is detached from actual dataframes, which is kind of the whole point of it)

indigoviolet commented 1 year ago

I haven't prototyped it, but you could imagine ColumnHelper knowing about active dataframes' column names, and dynamically defining properties for them.

I would love to have something like ColumnHelper -- suggested in the discord a whiile ago as well

aldanor commented 1 year ago

@indigoviolet well if you REALLY want, you can always do it yourself, just take globals() and filter all dataframe objects and their columns :)

You can even add _ipython_key_completions_() method to get square bracket completions.

alexander-beedie commented 1 year ago

You can even add _ipython_key_completions_() method to get square bracket completions.

I took care of that back in November: https://github.com/pola-rs/polars/pull/5477 ;) But, for exactly the reasons you gave, it only covers DataFrame and struct autocomplete.

ritchie46 commented 1 year ago

You can even add _ipython_key_completions_() method to get square bracket completions.

I took care of that back in November: #5477 ;) But, for exactly the reasons you gave, it only covers DataFrame and struct autocomplete.

We might look upwards in locals for LazyFrame/DataFrame types. 🤔

alexander-beedie commented 1 year ago

We might look upwards in locals for LazyFrame/DataFrame types. 🤔

I'd give that approach a qualified "sort of", given that locals could contain a lot of dataframes (on the low end maybe just 1-2, but it's not unreasonable to think there may be tens, hundreds, or even thousands in some use-cases), and you wouldn't be able to tie your col back to a specific one, meaning -as @aldanor says- you'd have walk everything in the locals dict to find dataframes and potentially offer all possible cols - or perhaps restrict it to "the most recent instantiated frame", but then frequently be wrong if the expression was being applied to something declared a few lines earlier...

(The frame/locals inspection trick works much better when identifying a specific named object; I pointed @universalmind303 at duckdb's code for doing that with their SQL queries that 'magically' pickup the named dataframe, for example - that works nicely as the locals will only contain one such named object).

If we were to offer an "official" helper object for such, we could perhaps make it capable of such (offer an opt-in flag), but I probably wouldn't turn it on by default.

jcmkk3 commented 1 year ago

How would it know where to pick column names from? (given that pl.col is detached from actual dataframes, which is kind of the whole point of it)

That's why I said that it would need to be implemented at the tooling level. It would need some combination of AST parsing (to see that pl.col is being used on a dataframe object) and object introspection (of the dataframe object to know what fields are available). Most tooling right now tends to focus on static analysis or dynamic introspection, unfortunately. There's no reason that a tool couldn't do both, however.

I think that declarative APIs like Polars uses is the ideal way to perform data analysis tasks, but the downside is that we can't get autocompletion using standard techniques like Pandas can.

gab23r commented 1 year ago

I think the ColumnHelper would be super valuable even if it is not yet with autocomplete. There is no autocomplete on pl.col by the way. I agree that the use of _ is not a good idea as it means something else that is used a lot. datatable use f shortcut for "frame proxy". But for our use case I like c as a shortcut for "column proxy"

char101 commented 1 year ago

For tooling access, we can add df.col and set the __dir__ magic method on that col. Thus the editor can inspect the col object and get the list of names. Of course with that you don't write pl.col but df.col

df = pl.DataFrame({a: [1,2,3]})
df.select(df.col.a)

Personally I just want to be able to type pl.col.name since it is just so convenient, esp. with how expression in written in polars when we need to call col multiple times.

Also rather than making a helper, we can just make col an object

import polars as pl

class Column:
    orig_col = pl.col

    def __call__(self, *names):
        return Column.orig_col(*names)

    def __getattr__(self, name):
        return self.__call__(name)

    def __getitem__(self, name):
        return self.__call__(name)

pl.col = Column()

By the way, __getitem__ in ColumnHelper above is a bit incorrect (although it still works the same), because __getitem__ only accepts one argument. a['b', 'c'] is actually a[('b', 'c')].

dwt commented 1 year ago

I would like to add my vote for df.col.name as that would enable auto completion in environments like Jupyter that can greatly ease exploring, debugging and working with DataFrames.

sslivkoff commented 1 year ago

pl.col.name would be better than df.col.name in contexts where you don't have a reference to the dataframe or where the dataframe variable name is long

at first glance this solution feels a bit weird. but I do think that typing out pl.col('name') can be burdensome and I can't think of anything more succinct+explicit than pl.col.name given the syntax constraints of python. alternatively pl.c.name might be nice too at the cost of some explicitness

I'm not really a fan of solutions that require instantiating an intermediate column helper object like ColumnHelper. There are situations where they work really well but they also increase mental overhead and maintenance burden enough that sometimes they are not worth it

dwt commented 1 year ago

pl.col.name or pl.c.name (the SQLAlchemy way) would work great as syntactic sugar. For autocompletion it is pointless though, as there is no way the runtime (Jupyter) can query the object for the attribute.

The easiest way to solve the long variable name syndrome is to introduce a local alias, i.e.

df = data_frame_with_long_name
df.c.name
sslivkoff commented 1 year ago

maybe the best way would be to have both pl.c and df.c

the local alias trick won't work in situations where you don't have access to the dataframe, like in

df = pl.scan_parquet(...).(...).select(pl.col('name')).collect()

or

df_joined = df1.join(df2, ...).groupby(...).agg(pl.col('name'))

basically situations with non-trivial function chaining

char101 commented 1 year ago

Using one character attribute name is not a good practice. Probably ok for loop variable or small scoped temporary variable but if you are going to use it in a lot of places, better make it readable.

For a context sensitive completion like that I think that is the job for the editor/completion server. It can see that pl refers to polars and generate the completion based on the surrounding code. It is not the place for a data processing library to implement complex autocompletion like that.

dwt commented 1 year ago

@char101 I may have not made myself clear, autocompletion in live coding environments like Jupyter (to my understanding) work by actually asking the object (df.col maybe also additionally available as df.c for brevity. In SQLAlchemy where I've seen this pattern first, it's actually Table.collumns and then Table.c for brevity where you want it.) what it's attributes are and then showing them in the autocomplete. That works, because the dir(df.col) actually returns all he attributes that are available.

To my understanding this is different from editors relying on language servers and tools like tree sitter to infer possible attributes from the context of the source code.

This is also why it makes live coding environments like Jupyter so damn effective at exploring objects (and datasets) because you an so easily look what attributes / functions they actually expose - no complicated language server / editor setup and knowledge of the code and frameworks needed. It all happens by talking to the actual object being completed on.

char101 commented 1 year ago

Jupyter is interactive and it should be easy to assign a variable to the dataframe and then implementing __dir__ on df.col is easy.

My comment is on the example by sslivkoff above where we create the dataframe and then select it pl.load_csv(...).select(pl.col...). Implementing autocompletion on pl.col is difficult because it is context sensitive. To do that we need to generate a traceback, find the outer frame, find the dataframe object, and then inspect the columns of the dataframe, and that is for a simple case.

sslivkoff commented 1 year ago

good point...autocomplete would be difficult to implement in the context of function chaining and probably has a large number of edge cases (alias(), join() suffixes, rename(), ...)

to your other point, I can also get onboard with the idea that one character names are too short. in the spectrum of

  1. pl.col('name')
  2. pl.col.name
  3. pl.c.name

the difference in ease-of-typing is much larger between 1&2 than between 2&3