Closed hyiltiz closed 1 year ago
I don't think variables are allowed to start with those characters.
To make them slightly shorter, you can do the following:
c = pl.col
multi_filter_df = df.filter((c("id") <= 2) & (c("size") == "small"))
As ghuls said, this isn't really possible. Python does not have a separate symbol literal like some languages have. For brevity's sake, the options are a single character identifier with parenthesis (c("comn_name")
), brackets (c["column_name"]
), or as attribute lookup (c.column_name
). The final option would be to do some sort of string parsing like pandas query
does, but at that point you can't really make use of any python language features since you've just created your own language.
Another solution is to store (intermediate) expressions in Python variables:
id = pl.col("id")
size = pl.col("size")
my_filter = (id <= 2) & (size == "small")
multi_filter_df = df.filter(my_filter)
This extends very well once you start building more complex expressions, for example when you want to re-use the filter expression.
On the renaming suggestion, the cleanest way is to do this on import I would say:
from polars import col as c
But it may be annoying for other users who are not used to this convention. There was a time we also had in the Polars docs often col()
rather than pl.col()
, but the latter is more explicit and really helps future readers of your code as it brings consistency to the table. That is worth a couple characters.
Just for info, with ibis
you can do
from ibis import _
df.filter(_.column_name == ...)
Just for info, with
ibis
you can dofrom ibis import _
df.filter(_.column_name == ...)
This would be like the attribute solution that I proposed above. I really hate that ibis made "_" an export. If folks want to use underscore as the alias then that would be their choice, but it is a poor choice for the library to promote. Underscore already has an idiomatic use in python for discarding a value. It would be really easy to accidentally stomp on your own code without thinking about it.
In polars, you could create a little helper like this one:
import polars as pl
class ColumnHelper:
def __getattr__(self, name):
return pl.col(name)
def __getitem__(self, name):
return pl.col(name)
c = ColumnHelper()
url = "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-07-28/penguins.csv"
penguins = pl.read_csv(url, null_values="NA")
# Access columns using attributes on the `ColumnHelper` instance
penguins.groupby(c.island, c.species).agg(c.body_mass_g.mean())
# Alternative when there are spaces or other invalid characters in name
penguins.groupby("island", "species").agg(c["body_mass_g"].mean())
# Also works with multiple columns
penguins.groupby(c.island, c.species).agg(c["bill_length_mm", "bill_depth_mm"].mean())
Some possible enhancements:
sets
do so you can reserve the attribute space for only column names.While not ideal, it should still be possible to capture the symbol literal and bypass the limits of Python's parser to introduce new syntax-sugar [1].
This is definitely not quite Pythonic and as such, probably not ideal for a major library like polars to promote. Looks like ColumnHelper
is the cleanest approach we have for now...
Looks like ColumnHelper is the cleanest approach we have for now...
Yup 👍 (And, indeed, hacking on the python parser to introduce peculiar syntax is not the way to go :) Maybe we could introduce something like the ColumnHelper as an opt-in/importable utility if there's demand; it does offer quite compact/clean usage for the simple cases...
I don't have a good solution for it, but what I'd like to see is the ability to autocomplete column names when using pl.col
or even something like the ColumnHelper
. It would need to be something added at the tooling level and not the Polars library, but I think it would be a big usability improvement for interactive use. I imagine that it would ideally live somewhere like the jedi
library and/or the various python LSP clients. If there could be some sort of standard, then other libraries like Ibis, Seaborn, Altair, etc. could also benefit.
what I'd like to see is the ability to autocomplete column names when using pl.col
How would it know where to pick column names from? (given that pl.col is detached from actual dataframes, which is kind of the whole point of it)
I haven't prototyped it, but you could imagine ColumnHelper knowing about active dataframes' column names, and dynamically defining properties for them.
I would love to have something like ColumnHelper -- suggested in the discord a whiile ago as well
@indigoviolet well if you REALLY want, you can always do it yourself, just take globals()
and filter all dataframe objects and their columns :)
You can even add _ipython_key_completions_()
method to get square bracket completions.
You can even add
_ipython_key_completions_()
method to get square bracket completions.
I took care of that back in November: https://github.com/pola-rs/polars/pull/5477 ;) But, for exactly the reasons you gave, it only covers DataFrame and struct autocomplete.
You can even add
_ipython_key_completions_()
method to get square bracket completions.I took care of that back in November: #5477 ;) But, for exactly the reasons you gave, it only covers DataFrame and struct autocomplete.
We might look upwards in locals for LazyFrame/DataFrame types. 🤔
We might look upwards in locals for LazyFrame/DataFrame types. 🤔
I'd give that approach a qualified "sort of", given that locals
could contain a lot of dataframes (on the low end maybe just 1-2, but it's not unreasonable to think there may be tens, hundreds, or even thousands in some use-cases), and you wouldn't be able to tie your col
back to a specific one, meaning -as @aldanor says- you'd have walk everything in the locals dict to find dataframes and potentially offer all possible cols - or perhaps restrict it to "the most recent instantiated frame", but then frequently be wrong if the expression was being applied to something declared a few lines earlier...
(The frame/locals inspection trick works much better when identifying a specific named object; I pointed @universalmind303 at duckdb's code for doing that with their SQL queries that 'magically' pickup the named dataframe, for example - that works nicely as the locals
will only contain one such named object).
If we were to offer an "official" helper object for such, we could perhaps make it capable of such (offer an opt-in flag), but I probably wouldn't turn it on by default.
How would it know where to pick column names from? (given that pl.col is detached from actual dataframes, which is kind of the whole point of it)
That's why I said that it would need to be implemented at the tooling level. It would need some combination of AST parsing (to see that pl.col
is being used on a dataframe object) and object introspection (of the dataframe object to know what fields are available). Most tooling right now tends to focus on static analysis or dynamic introspection, unfortunately. There's no reason that a tool couldn't do both, however.
I think that declarative APIs like Polars uses is the ideal way to perform data analysis tasks, but the downside is that we can't get autocompletion using standard techniques like Pandas can.
I think the ColumnHelper would be super valuable even if it is not yet with autocomplete. There is no autocomplete on pl.col by the way. I agree that the use of _
is not a good idea as it means something else that is used a lot. datatable use f
shortcut for "frame proxy". But for our use case I like c
as a shortcut for "column proxy"
For tooling access, we can add df.col
and set the __dir__
magic method on that col
. Thus the editor can inspect the col
object and get the list of names. Of course with that you don't write pl.col
but df.col
df = pl.DataFrame({a: [1,2,3]})
df.select(df.col.a)
Personally I just want to be able to type pl.col.name
since it is just so convenient, esp. with how expression in written in polars when we need to call col
multiple times.
Also rather than making a helper, we can just make col
an object
import polars as pl
class Column:
orig_col = pl.col
def __call__(self, *names):
return Column.orig_col(*names)
def __getattr__(self, name):
return self.__call__(name)
def __getitem__(self, name):
return self.__call__(name)
pl.col = Column()
By the way, __getitem__
in ColumnHelper
above is a bit incorrect (although it still works the same), because __getitem__
only accepts one argument. a['b', 'c']
is actually a[('b', 'c')]
.
I would like to add my vote for df.col.name
as that would enable auto completion in environments like Jupyter that can greatly ease exploring, debugging and working with DataFrames
.
pl.col.name
would be better than df.col.name
in contexts where you don't have a reference to the dataframe or where the dataframe variable name is long
at first glance this solution feels a bit weird. but I do think that typing out pl.col('name')
can be burdensome and I can't think of anything more succinct+explicit than pl.col.name
given the syntax constraints of python. alternatively pl.c.name
might be nice too at the cost of some explicitness
I'm not really a fan of solutions that require instantiating an intermediate column helper object like ColumnHelper
. There are situations where they work really well but they also increase mental overhead and maintenance burden enough that sometimes they are not worth it
pl.col.name
or pl.c.name
(the SQLAlchemy way) would work great as syntactic sugar. For autocompletion it is pointless though, as there is no way the runtime (Jupyter) can query the object for the attribute.
The easiest way to solve the long variable name syndrome is to introduce a local alias, i.e.
df = data_frame_with_long_name
df.c.name
maybe the best way would be to have both pl.c
and df.c
the local alias trick won't work in situations where you don't have access to the dataframe, like in
df = pl.scan_parquet(...).(...).select(pl.col('name')).collect()
or
df_joined = df1.join(df2, ...).groupby(...).agg(pl.col('name'))
basically situations with non-trivial function chaining
Using one character attribute name is not a good practice. Probably ok for loop variable or small scoped temporary variable but if you are going to use it in a lot of places, better make it readable.
For a context sensitive completion like that I think that is the job for the editor/completion server. It can see that pl refers to polars and generate the completion based on the surrounding code. It is not the place for a data processing library to implement complex autocompletion like that.
@char101 I may have not made myself clear, autocompletion in live coding environments like Jupyter (to my understanding) work by actually asking the object (df.col
maybe also additionally available as df.c
for brevity. In SQLAlchemy where I've seen this pattern first, it's actually Table.collumns
and then Table.c
for brevity where you want it.) what it's attributes are and then showing them in the autocomplete. That works, because the dir(df.col)
actually returns all he attributes that are available.
To my understanding this is different from editors relying on language servers and tools like tree sitter to infer possible attributes from the context of the source code.
This is also why it makes live coding environments like Jupyter so damn effective at exploring objects (and datasets) because you an so easily look what attributes / functions they actually expose - no complicated language server / editor setup and knowledge of the code and frameworks needed. It all happens by talking to the actual object being completed on.
Jupyter is interactive and it should be easy to assign a variable to the dataframe and then implementing __dir__
on df.col
is easy.
My comment is on the example by sslivkoff above where we create the dataframe and then select it pl.load_csv(...).select(pl.col...)
. Implementing autocompletion on pl.col
is difficult because it is context sensitive. To do that we need to generate a traceback, find the outer frame, find the dataframe object, and then inspect the columns of the dataframe, and that is for a simple case.
good point...autocomplete would be difficult to implement in the context of function chaining and probably has a large number of edge cases (alias()
, join()
suffixes, rename()
, ...)
to your other point, I can also get onboard with the idea that one character names are too short. in the spectrum of
pl.col('name')
pl.col.name
pl.c.name
the difference in ease-of-typing is much larger between 1&2 than between 2&3
Polars version checks
[X] I have checked that this issue has not already been reported.
[X] I have confirmed this bug exists on the latest version of Polars.
Issue description
Writing
pl.col('some_column')
is too long to type, esp. since working with columns is the main focus for most operations supported in Polars. Would it be possible to treat some special symbol as column names within functions that accept column name specifications?Reproducible example
Expected behavior
Using
:column
(or some other notation since:
is significant in python syntax) to provide syntax sugar overpl.col()
would make specifying columns much easier.Alternatives:
:column
:
is significant syntax in Python$column
@column
Installed versions