pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.47k stars 17.87k forks source link

BUG: Incorrect outer join when merging multi-index with a single-index #20042

Open kylebarron opened 6 years ago

kylebarron commented 6 years ago

Code Sample

In [1]: import pandas as pd

In [2]: df1 = pd.DataFrame([['a', 'x', 0.123], ['a','x', 0.234],
   ...:                     ['a', 'y', 0.451], ['b', 'x', 0.453]],
   ...:                    columns=['first', 'second', 'value1']
   ...:                    ).set_index(['first', 'second'])
   ...:                    

In [3]: df2 = pd.DataFrame([['a', 10],['b', 20]],
   ...:                    columns=['first', 'value']).set_index(['first'])
   ...: 

In [4]: df3 = pd.DataFrame([['a', 1], ['b', 2], ['c', 3]],
   ...:                    columns=['first', 'final_val']).set_index(['first'])
   ...: 

In [5]: df1
Out[5]: 
              value1
first second        
a     x        0.123
      x        0.234
      y        0.451
b     x        0.453

In [6]: df2
Out[6]: 
       value
first       
a         10
b         20

In [7]: df3
Out[7]: 
       final_val
first           
a              1
b              2
c              3

In [8]: df1.join(df3, how='outer')
Out[8]: 
              value1  final_val
first second                   
a     x        0.123          1
      x        0.234          1
      y        0.451          1
b     x        0.453          2

In [9]: df2.join(df3, how='outer')
Out[9]: 
       value  final_val
first                  
a       10.0          1
b       20.0          2
c        NaN          3

Problem description

I expect the result of an outer join to have all values of the level of the index on which the join is being performed, similar to how the single-index to single-index outer join includes c in the index of the result.

Expected Output

I expect the result from df1.join(df3, how='outer') to be:

              value1  final_val
first second                   
a     x        0.123          1
      x        0.234          1
      y        0.451          1
b     x        0.453          2
c     NaN      NaN            3

or something similar (i.e. maybe '' instead of NaN for second, since it's not numeric).

Output of pd.show_versions()

In [10]: pd.show_versions() INSTALLED VERSIONS ------------------ commit: None python: 3.6.4.final.0 python-bits: 64 OS: Linux OS-release: 4.13.0-36-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 pandas: 0.22.0 pytest: 3.4.0 pip: 9.0.1 setuptools: 38.5.1 Cython: 0.27.3 numpy: 1.14.1 scipy: 1.0.0 pyarrow: 0.8.0 xarray: None IPython: 6.2.1 sphinx: 1.7.1 patsy: 0.5.0 dateutil: 2.6.1 pytz: 2018.3 blosc: None bottleneck: 1.2.1 tables: 3.4.2 numexpr: 2.6.4 feather: 0.4.0 matplotlib: 2.1.2 openpyxl: 2.5.0 xlrd: 1.1.0 xlwt: 1.3.0 xlsxwriter: 1.0.2 lxml: 4.1.1 bs4: 4.6.0 html5lib: 1.0.1 sqlalchemy: 1.2.3 pymysql: None psycopg2: None jinja2: 2.10 s3fs: None fastparquet: 0.1.4 pandas_gbq: None pandas_datareader: None
gfyoung commented 6 years ago

Hmm...that indeed looks weird. Investigation and PR to patch are welcome!