pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.56k stars 17.9k forks source link

Joining on a column, not index, gives erroneous result #13799

Open Dmitrii-I opened 8 years ago

Dmitrii-I commented 8 years ago

Code Sample, a copy-pastable example if possible

import pandas as pd
d1 = pd.DataFrame({'a': [1, 2], 'b': [11, 22]})
d2 = pd.DataFrame({'a': [1, 2], 'c': ['aa', 'bb']})
d1.join(d2, on=['a'], how='outer', lsuffix='.d1', rsuffix='.d2')

Actual output

     a  a.d1     b  a.d2    c
0  1.0   1.0  11.0   2.0   bb
1  2.0   2.0  22.0   NaN  NaN
1  0.0   NaN   NaN   1.0   aa

Expected Output

   a   b   c
0  1  11  aa
1  2  22  bb

output of pd.show_versions()

INSTALLED VERSIONS

commit: None python: 3.4.3.final.0 python-bits: 64 OS: Linux OS-release: 3.19.0-65-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8

pandas: 0.18.1 nose: None pip: 1.5.4 setuptools: 3.3 Cython: None numpy: 1.10.1 scipy: 0.17.1 statsmodels: 0.6.1 xarray: None IPython: 4.0.0 sphinx: None patsy: 0.4.1 dateutil: 2.5.2 pytz: 2016.3 blosc: None bottleneck: None tables: 3.2.2 numexpr: 2.4.6 matplotlib: None openpyxl: None xlrd: None xlwt: None xlsxwriter: None lxml: None bs4: None html5lib: 0.999 httplib2: 0.9.2 apiclient: None sqlalchemy: None pymysql: None psycopg2: 2.6.1 (dt dec pq3 ext lo64) jinja2: 2.8 boto: None pandas_datareader: None

sinhrks commented 8 years ago

yeah looks strange. merge should work though...

pd.merge(d1, d2, on='a', how='outer')
#    a   b   c
# 0  1  11  aa
# 1  2  22  bb
jorisvandenbossche commented 8 years ago

There is also a problem with the overlapping columns detection:

In [11]: d1.join(d2, on=['a'])
...
ValueError: columns overlap but no suffix specified: Index([u'a'], dtype='object')

as this should of course not be raised if this is the column you want to merge on.

ivannz commented 8 years ago

how="inner" also gives unexpected results:

d1.join(d1, on=['a'], how='inner', lsuffix='.d1', rsuffix='.d2')
#    a  a.d1  b.d1  a.d2  b.d2
# 0  1     1    11     2    22

d2.join(d1, on=['a'], how='inner', lsuffix='.d1', rsuffix='.d2')
#    a  a.d1   c  a.d2   b
# 0  1     1  aa     2  22
jreback commented 8 years ago

you just need a simple concat or merge here.

In [4]: pd.concat([d1,d2],axis=1)
Out[4]: 
   a   b  a   c
0  1  11  1  aa
1  2  22  2  bb

In [5]: pd.merge(d1,d2, on=['a'], how='outer')
   ...: 
Out[5]: 
   a   b   c
0  1  11  aa
1  2  22  bb

Here is what on does.

on : column name, tuple/list of column names, or array-like
    Column(s) in the caller to join on the index in other,
    otherwise joins index-on-index. If multiples
    columns given, the passed DataFrame must have a MultiIndex. Can
    pass an array as the join key if not already contained in the
    calling DataFrame. Like an Excel VLOOKUP operation

So this has to do with the overlapping detection. I think this should just raise as its fundamentally a user error here. This is not a join/merge rather a concat; here the difference matters.