[ENH] Multi-dataframe simultaneous merging

pyjanitor-devs / pyjanitor

Clean APIs for data cleaning. Python implementation of R package Janitor

https://pyjanitor-devs.github.io/pyjanitor

MIT License

1.35k stars 169 forks source link

[ENH] Multi-dataframe simultaneous merging #391

Open zbarry opened 5 years ago

zbarry commented 5 years ago

See: https://stackoverflow.com/questions/23668427/pandas-three-way-joining-multiple-dataframes-on-columns

Most of the methods proposed are something like a .merge().merge().merge chaining. This seems like a pretty computationally suboptimal way of doing things (as far as I can tell). If you scroll down to the "This is an ideal situation for the join method" solution, that seems like a much better way to go. Maybe there could be a pyjanitor merge_multiple function that emulates the high-level API of merge using join under the hood.

szuckerman commented 4 years ago

Found this after going through some old issues ;)

I actually do this often and the most clean solution I’ve found is to have an iterator of DataFrames and use itertools.reduce to reduce using either the pandas join or merge functionality.

ericmjl commented 4 years ago

@szuckerman do you think you could paste here a strawman sketch of how you've done it? Perhaps it could become just the solution!

zbarry commented 4 years ago

The reduce route would be a variant on .merge().merge(), I think. The problem that I didn’t explain very well in hindsight is that this is very inefficient for large lists of dataframes - you create a lot of unnecessary intermediates this way:

merge A, B -> AB
merge AB, C -> ABC
merge ABC, D -> ABCD
...

Imagine A,B,C,D were all large dataframes, and you can see why this would be a slow approach as many temporary dataframes are allocated and the same data is copied over and over again. Instead, with the .join() method, you can join all of them simultaneously.

ericmjl commented 4 years ago

True that. MultiIndex joins are possible too, right? I ask because merge allows merging on common columns, whereas I think join assumes that the index must be identical.

Maybe the idea needs to be fleshed out a bit more. I have done this before where:

df1 and df2 need to be merged based on columns A, C, D, giving df12,
and then df12 needs to be merged with df3 on the basis of columns B, C, D, E.

(Don't ask - the definition of "a sample" was convoluted in that project...)

merge gives a very natural route, though I have yet to try this with joins, so maybe with some cleverness, it'll work too?

samukweku commented 4 years ago

Looking through existing issues to see if I can knock them off. This looks appealing and a little mind bending. So, joining multiple dataframes on possibly different column names? I could push a PR on this and see what the team thinks?

samukweku commented 4 years ago

Unassigned myself, so I can get feedback from others, on what the API should look like. Decided not to assume, as the original creators of this issue might have some ideas that I am not privy to.

ericmjl commented 4 years ago

Going to nerd snipe @zbarry on this one. I think I'd like to hear what the scope of the function could look like.