Open zbarry opened 5 years ago
Found this after going through some old issues ;)
I actually do this often and the most clean solution I’ve found is to have an iterator of DataFrames and use itertools.reduce to reduce using either the pandas join or merge functionality.
@szuckerman do you think you could paste here a strawman sketch of how you've done it? Perhaps it could become just the solution!
The reduce
route would be a variant on .merge().merge()
, I think. The problem that I didn’t explain very well in hindsight is that this is very inefficient for large lists of dataframes - you create a lot of unnecessary intermediates this way:
merge A, B -> AB
merge AB, C -> ABC
merge ABC, D -> ABCD
...
Imagine A,B,C,D were all large dataframes, and you can see why this would be a slow approach as many temporary dataframes are allocated and the same data is copied over and over again. Instead, with the .join()
method, you can join all of them simultaneously.
True that. MultiIndex joins are possible too, right? I ask because merge allows merging on common columns, whereas I think join assumes that the index must be identical.
Maybe the idea needs to be fleshed out a bit more. I have done this before where:
df1
and df2
need to be merged based on columns A, C, D
, giving df12
,df12
needs to be merged with df3
on the basis of columns B, C, D, E
.(Don't ask - the definition of "a sample" was convoluted in that project...)
merge
gives a very natural route, though I have yet to try this with join
s, so maybe with some cleverness, it'll work too?
Looking through existing issues to see if I can knock them off. This looks appealing and a little mind bending. So, joining multiple dataframes on possibly different column names? I could push a PR on this and see what the team thinks?
Unassigned myself, so I can get feedback from others, on what the API should look like. Decided not to assume, as the original creators of this issue might have some ideas that I am not privy to.
Going to nerd snipe @zbarry on this one. I think I'd like to hear what the scope of the function could look like.
See: https://stackoverflow.com/questions/23668427/pandas-three-way-joining-multiple-dataframes-on-columns
Most of the methods proposed are something like a
.merge().merge().merge
chaining. This seems like a pretty computationally suboptimal way of doing things (as far as I can tell). If you scroll down to the "This is an ideal situation for the join method" solution, that seems like a much better way to go. Maybe there could be a pyjanitormerge_multiple
function that emulates the high-level API ofmerge
usingjoin
under the hood.