pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.43k stars 17.85k forks source link

Aliases for column names #11723

Open bbirand opened 8 years ago

bbirand commented 8 years ago

When I work with Pandas DataFrames, I prefer to keep the full column names for clarity. So when I print out the head, or use describe, I get a meaningful table. However, this also means I have column names like "Time of Sale" that become annoying to type out.

A nice compromise seems like it would be to have short "aliases" for column names. For instance, I can define the tos average for the above, perhaps like so:

df = pd.read_csv(...)
df.set_alias({'Time of Sale' : 'tos'})

Then, the __get_attribute__ method can look up aliases in addition to column names, so I can refer to that column simply as df.tos. But for all other purposes, the columns name is still the descriptive full name.

Would this make sense?

jreback commented 8 years ago

related to #10349

I suppose this is possible. This would be fairly easy to implement, but would require a good number of test cases to ensure its propogating correctly (e.g. this is analagous to the name attribute for Indexes in that it propogates when appropriate).

Further would require an audit of the indexing code for it to be a synonymous application (e.g. you can use the alias where you could use the actual label).

So while this is interesting, it would require a pull-request from the community to jump start it.

bbirand commented 8 years ago

I'll have a go at this when I get a chance. It also occurred to me that these aliases may be useful when dealing with DataFrame.query() methods. Based on my trials, this function does not work when there are spaces on the column names (please correct me if I'm wrong, I couldn't get them to work).

jreback commented 8 years ago

no .query processes strings so you cannot use strings, this is noted in the documentation.

shoyer commented 8 years ago

I'm not a big fan of including this feature in pandas itself, because it would make the pandas data model significantly more complex. Maybe this could be implemented in some sort of add-on package that wraps pandas DataFrames? Another option would be a DataFrame subclass.

ijstokes commented 8 years ago

There are certainly risks that could be introduced from adding aliasing, but wouldn't a straightforward strategy be to augment the logic in get_attribute() that, presumably, already does some form of this. So if an alias dictionary existed on the DataFrame then it would try again provided the requested attribute (not found using "the usual mechanism") had a key entry in the alias dictionary. E.g.

# 1. works today:
df['Time of Sale']

# 2. fails today:
df.time_of_sale

# 3. could work in the future:
df.alias = dict(time_of_sale='Time of Sale')
df.time_of_sale

Or maybe I misunderstand and 2. is already possible today. If so, could someone point me in the right direction toward documentation? I too would find this quite useful.

bbirand commented 8 years ago

Or maybe I misunderstand and 2. is already possible today. If so, could someone point me in the right direction toward documentation? I too would find this quite useful.

In order to do 2., you would have to rename the column, possibly using http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.rename.html

And then, when you'd like to print or plot it, you'd rename it back to the original version.

I too think this would still be a good addition for interactive work. To make things even more interesting, I would alias "Time of Sale" to "tos", so I can work with the data as df.tos, but then see the full name when plotted.

KeithWM commented 6 years ago

I'd also like to see such a feature. For me the favourite use case would be to have nice, legible axes labels (with units) in seaborn plots. I know one can manually set the axis labels, but I find this error prone, too verbose and it leads to code duplication.

If you ask me, the easier way would be to keep the current name in the role of an alias as @bbirand proposes, and to add some other field for a longer name, which can default to the "normal" name if none is explicitly given.

ajeet2808 commented 4 years ago

Any update on this feature?

obarak commented 4 years ago

we need and equivalent for "SELECT max(column1)*0.25+ 0.44*sum(column2) as 'calculated_column' from TABLE group by column3,column5"

@luisfelipe18 - Actually, for aggregation you already have aliasing in Pandas, see here (I'd recommend reading through the entire post).

The current issue refers to aliasing existing columns, regardless of aggregation.

TomAugspurger commented 3 years ago

IMO, we shouldn't use this in pandas itself. Indexing is complicated enough without aliases.

We'd be better served by adopting / defining a convention (similar to how xarray uses CF conventions) for mapping column names to descriptive names. These could be stored in the DataFrame.attrs dict which (should) propagate through operations. Then downstream libraries (e.g. plotting libraries, libraries for generating tables for presentation) can use the descriptive names.

adavidzh commented 3 years ago

I'd like to echo @KeithWM's point about there being a "long form" for a column's contents, i.e. something that seaborn can use in axis labels. This would not necessarily be an alias, but rather a human-readable form with full description (e.g. involving LaTeX) and units. This sort of thing comes up over and over again in making scientific plots; I found this thread because I want a column named engine_data_total_throughput and would like the axis label to be $\sum$ data throughput [Gb/s] without having to specify it over and over again when plotting.

I understand that this "long form" is not a general scheme for creating aliases (that is a many-to-one correspondence) and it could make sense to understand what is the main use case and, perhaps, have a new thread.

Krzmbrzl commented 2 years ago

If I understood this issue correctly, it is about the intention to retain the (potentially long and verbose) original column names for everything but for accessing the columns in code.

As I see it, we can already get exactly that without any modification to pandas at all: Just define some constants and then use those to access your columns in your code:

class Columns:
    colA = "My tediously long name for column A"
    colB = "Yet another long column name"
    colC = "Some column with $\emph{special}$ symbols in it"

df = pd.read_csv(...)

print(df[Columns.colA])

Using a separate class to create a namespace for the column constants is of course optional and you can omit it if you prefer.

If I did not miss anything, this seems to fit all scenarios in which one would want to use aliases, unless you are trying to alias some columns to allow something like column-duck-typing. But I guess that would probably only get messy really quickly anyway :thinking: