pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.57k stars 17.9k forks source link

merge_ordered() should accept left_index and right_index args #27760

Open austin3dickey opened 5 years ago

austin3dickey commented 5 years ago

Code Sample, a copy-pastable example if possible

INDEX_COLS = ['group', 'timestamp']

left = pandas.DataFrame({
    'group': ['a', 'a', 'b', 'b'],
    'timestamp': [0, 1, 2, 5],
    'obs1': [10, 11, 12, 15]
}).set_index(INDEX_COLS)

right = pandas.DataFrame({
    'group': ['a', 'a', 'b', 'b'], 
    'timestamp': [0, 1, 3, 4],
    'obs2': [20, 21, 23, 24]
}).set_index(INDEX_COLS)

pandas.merge_ordered(left, right)

# Traceback (most recent call last):
#  File "<stdin>", line 1, in <module>
#  File "python3.6/site-packages/pandas/core/reshape/merge.py", line 226, in merge_ordered
#    result = _merger(left, right)
#  File "python3.6/site-packages/pandas/core/reshape/merge.py", line 212, in _merger
#    how=how)
#  File "python3.6/site-packages/pandas/core/reshape/merge.py", line 1252, in __init__
#    sort=True  # factorize sorts
#  File "python3.6/site-packages/pandas/core/reshape/merge.py", line 524, in __init__
#    self._validate_specification()
#  File "python3.6/site-packages/pandas/core/reshape/merge.py", line 1033, in _validate_specification
#    lidx=self.left_index, ridx=self.right_index))
# pandas.errors.MergeError: No common columns to perform merge on. Merge options: left_on=None, right_on=None, left_index=False, right_index=False

Problem description

The error message above implies that I should be able to merge these two timeseries DataFrames using left_index=True and right_index=True. But these are not arguments to the merge_ordered() function like they are to other merging functions.

I could work around this by doing the following:

pandas.merge_ordered(left, right, left_on=INDEX_COLS, right_on=INDEX_COLS).set_index(INDEX_COLS)

but that seems unnecessarily verbose, and possibly less performant than if I could do

pandas.merge_ordered(left, right, left_index=True, right_index=True)

Proposal

It looks like merge_ordered() uses _OrderedMerge under the hood which can take the left_index and right_index arguments (and the code example above works as expected if those arguments are passed through). Would you consider adding these arguments? I can create a PR if there's appetite.

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.6.8.final.0 python-bits: 64 OS: Darwin OS-release: 17.7.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 pandas: 0.24.2 pytest: 4.2.0 pip: 19.0.1 setuptools: 40.7.3 Cython: None numpy: 1.16.2 scipy: 1.2.1 pyarrow: None xarray: None IPython: 7.2.0 sphinx: 1.8.4 patsy: None dateutil: 2.8.0 pytz: 2018.9 blosc: None bottleneck: None tables: None numexpr: None feather: None matplotlib: None openpyxl: None xlrd: None xlwt: None xlsxwriter: None lxml.etree: None bs4: None html5lib: None sqlalchemy: None pymysql: None psycopg2: None jinja2: 2.10 s3fs: None fastparquet: None pandas_gbq: None pandas_datareader: None gcsfs: None
milind commented 1 year ago

Hey - I know that I'm resurrecting a pretty old issue here, but I was wondering whether this is something that might be addressed at some point? I had originally assumed that the default interaction with pd.merge would be the same as pd.merge_ordered but I'm finding that index merging support doesn't seem to be in parity with the more widely-used counterpart. Very easy to work around so it's not pressing, but figured that I would bump this issue to resurface it.

DrNickClarke commented 8 months ago

I would find this useful too. One of the issues I am having is that many of the ways to do this kind of thing create intermediate states with missing values (NaN) which results in int columns being converted to float as part of the process. I am currently using concat with sort_index, which is very inefficient if both the input dataframes are sorted. This look like it could work for my use case if it could use the index directly.