pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.66k stars 17.91k forks source link

ENH: allow preserving one of the indexes when merging two DataFrames #46882

Open multimeric opened 2 years ago

multimeric commented 2 years ago

Is your feature request related to a problem?

I want to be able to merge two DataFrames, but keep the index of the left one in the final result:

>>> import pandas as pd
>>> import string
>>> df1 = pd.DataFrame({"a": range(5), "b": range(10, 15)}, index=list(string.ascii_lowercase[:5]))
>>> df2 = pd.DataFrame({"a": range(5), "c": list(string.ascii_uppercase[:5])})
>>> df1
   a   b
a  0  10
b  1  11
c  2  12
d  3  13
e  4  14
>>> df2
   a  c
0  0  A
1  1  B
2  2  C
3  3  D
4  4  E

The current merge behaviour is to just drop the index entirely:

>>> df1.merge(df2, on="a")
   a   b  c
0  0  10  A
1  1  11  B
2  2  12  C
3  3  13  D
4  4  14  E

Describe the solution you'd like

We add a new parameter preserve_index to merge, which takes either "left", "right", or None

DataFrame.merge(preserve_index="left")

In my above example, this would work like:

>>> df1.merge(df2, on="a", preserve_index="left")
   a   b  c
a  0  10  A
b  1  11  B
c  2  12  C
d  3  13  D
e  4  14  E

API breaking implications

None. This is a new parameter, and if it is not provided the API is identical.

Describe alternatives you've considered

It is already possible to work around this by resetting the index and then setting it as an index again, as described here but this is:

attack68 commented 2 years ago

isn't it just as easy to use df1.merge(df2, on="a").set_index("a")? Otherwise we risk introducing features that need to be maintained and tested with further developments when these method already exist?

edit: Now i see the end of your post, ok, but im -1 on this.

multimeric commented 2 years ago

You also have to reset the index to ensure it's a column, and I think the three points above show enough merit to make this worthwhile. A chain of 3 methods versus one method and one parameter is a big improvement.

Mehgarg commented 2 years ago

take

attack68 commented 2 years ago

@multimeric its fair to give a full response on this since you raise sensible points.

The pandas API is large (too large). My general approach is to not add any args / methods that perform functions that can already be performed. In fact I am in favour of selectively removing / reducing args when multiple ways of performing tasks exist. And my PRs reflect this philosophy.

Probably less efficient

In the long run this has the advantage of making code more maintainable for developers, and likely improves performance since those core methods can be optimised for general tasks as opposed to optimising selective and individual cases, or specific ways to handle args. This is important for the longevity, and future development of pandas.

More verbose

This is subjective. Personally I strive for an atomised code construction. In software development I prefer using core methods rather than subtle args to avoid the operational risk of arg deprecation. merge and set_index are core methods so are unlikely to be restructured, so I would favour chaining these, especially where merge is such a complex method in terms of combinatorial challenges.

Not intuitive or clear to users

Fully agree. I think use cases like this and adding to documention and cookbooks are valuable and we should work to provide better examples that users can copy, in the knowledge that pandas teams offers confidence that it is the "most efficient" way. This is a development item and something we need to do better.

Sorry I don't support your idea, hope you appreciate my feedback.