pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.74k stars 17.95k forks source link

Very unintuitive to `join` two dataframes with indexes that overlap partially #10621

Closed dlenski closed 9 years ago

dlenski commented 9 years ago

I posted a slightly-ranting question about this on StackOverflow.

Basically, I want to use pandas to join two dataframes wherein the index levels of one are a subset of the index levels of the other. For example, the two DFs below have index.names of ('fruit', 'citrus') and ('citrus',) respectively:

fruit_max_by_date =

                    0
    fruit   citrus   
    apples  False   6
    bananas False   5
    oranges True    8
    lemons  True    9

    [4 rows x 1 columns]

citrus_max_by_date =

            1
    citrus   
    False   6
    True    9

    [2 rows x 1 columns]

The intuitive approach to joining them fails, because the indexes do not exactly match:

fruit_max_by_date.join(citrus_max_by_date) =>

                    0   1
    fruit   citrus       
    apples  False   6 NaN
    bananas False   5 NaN
    oranges True    8 NaN
    lemons  True    9 NaN

    [4 rows x 2 columns]

I found a variety of workarounds, none of which I consider very satisfactory. Eventually I found one that doesn't involve giving up on using an index altogether. It involves rebroadcasting the "smaller" index to match the "larger" index:

fruit_max_by_date.join( 
  citrus_max_by_date.reindex(fruit_max_by_date.index, level='citrus') ) =>

                    0  1
    fruit   citrus      
    apples  False   6  6
    bananas False   5  6
    oranges True    8  9
    lemons  True    9  9

    [4 rows x 2 columns]

Is there any reason not to make this behavior automatic?

TomAugspurger commented 9 years ago

Do you want this?

In [8]: pd.merge(fruit_max_by_date, citrus_max_by_date, left_index=True, right_index=True)
Out[8]:
                0  1
fruit   citrus
apples  False   6  6
bananas False   5  6
oranges True    8  9
lemons  True    9  9

I typically use merge instead of join fwiw.

TomAugspurger commented 9 years ago

Actually, join gives the same result for me.

In [11]: fruit_max_by_date.join(citrus_max_by_date)
Out[11]:
                0  1
fruit   citrus
apples  False   6  6
bananas False   5  6
oranges True    8  9
lemons  True    9  9

What version of pandas are you using?

dlenski commented 9 years ago

Drat. I'm using v0.13. Looks like this was partially fixed in 0.14.

Joining a single index with a multi-index now "just works": http://pandas.pydata.org/pandas-docs/stable/merging.html#merging-join-on-mi

However, joining two multi-indexes still doesn't work according to the latest docs, which recommend removing the index as a workaround: http://pandas.pydata.org/pandas-docs/stable/merging.html#joining-with-two-multi-indexes

jorisvandenbossche commented 9 years ago

As a reference, the issue for multi-index on multi-index join is here: #6360

Apparently there were some attempts but never made it in. Feel free to look at it!