pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.71k stars 17.92k forks source link

ENH: Support for fill/nearest indexers for non-unique indexes #51646

Open mx781 opened 1 year ago

mx781 commented 1 year ago

Feature Type

Problem Description

I'm trying to return the requested-or-previous index of a monotonic-increasing DataFrame. If the index is also unique, this works fine:

df = pd.DataFrame({"a": range(5)}, index=[1,2,3,4,5])
df.index.get_indexer([6], method="ffill")
# >>> array([4])

However if the index is not unique, InvalidIndexError is raised:

df = pd.DataFrame({"a": range(6)}, index=[1,2,3,3,4,5])
df.index.get_indexer([6], method="ffill")
# >>> InvalidIndexError: Reindexing only valid with uniquely valued Index objects

The same occurs with all other methods, even None (which I thought should work, since df.index.get_loc(3) works fine and returns a slice).

Feature Description

This limitation doesn't seem to be outlined anywhere in the docs, so I'm unsure if this is a missing feature / an error on my part or perhaps a bug? If indeed a missing feature, and I'm sure no small effort - would you accept a PR? The desired behavior here would be to simply return the prev/next/nearest slice.

Alternative Solutions

I didn't see any other functions in the API that would work around this - but perhaps there's an approach here that I'm missing?

Additional Context

Thanks for all the hard work on Pandas!

mx781 commented 1 year ago

So after mucking around a bit in the source, I'm realizing that get_indexer is inherently meant for unique indexes only, and get_indexer_non_unique doesn't support non-default methods, so I take it this is just not supported. Anyone care to weigh in how complex of an addition might this be, or if there are blockers to do this in the first place?

In the meanwhile, here's a naive/slow/untested workaround for anyone stumbling upon this use case:

def get_non_unique_fill_indexer(index, key, method="ffill",tolerance=None):  
    assert method in {"ffill", "bfill"}  
    duplicates = index.duplicated()  
    index_deduplicated = index[~duplicates]  
    dedup_indexer = index_deduplicated.get_indexer([key], method=method, tolerance=tolerance).item()  
    if dedup_indexer == -1:  
        raise KeyError(key)  
    num_duplicates_before = len(index[(index < key) & duplicates])  
    indexer_end = index[num_duplicates_before + dedup_indexer]  
    indexer = index.get_loc(indexer_end)  
    return indexer