ENH: Support for fill/nearest indexers for non-unique indexes

pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more

BSD 3-Clause "New" or "Revised" License

43.71k stars 17.92k forks source link

Feature Type

[X] Adding new functionality to pandas
[X] Changing existing functionality in pandas
[ ] Removing existing functionality in pandas

Problem Description

I'm trying to return the requested-or-previous index of a monotonic-increasing DataFrame. If the index is also unique, this works fine:

df = pd.DataFrame({"a": range(5)}, index=[1,2,3,4,5])
df.index.get_indexer([6], method="ffill")
# >>> array([4])

However if the index is not unique, InvalidIndexError is raised:

df = pd.DataFrame({"a": range(6)}, index=[1,2,3,3,4,5])
df.index.get_indexer([6], method="ffill")
# >>> InvalidIndexError: Reindexing only valid with uniquely valued Index objects

The same occurs with all other methods, even None (which I thought should work, since df.index.get_loc(3) works fine and returns a slice).

Feature Description

This limitation doesn't seem to be outlined anywhere in the docs, so I'm unsure if this is a missing feature / an error on my part or perhaps a bug? If indeed a missing feature, and I'm sure no small effort - would you accept a PR? The desired behavior here would be to simply return the prev/next/nearest slice.

Alternative Solutions

I didn't see any other functions in the API that would work around this - but perhaps there's an approach here that I'm missing?

Additional Context

Thanks for all the hard work on Pandas!

def get_non_unique_fill_indexer(index, key, method="ffill",tolerance=None): assert method in {"ffill", "bfill"} duplicates = index.duplicated() index_deduplicated = index[~duplicates] dedup_indexer = index_deduplicated.get_indexer([key], method=method, tolerance=tolerance).item() if dedup_indexer == -1: raise KeyError(key) num_duplicates_before = len(index[(index < key) & duplicates]) indexer_end = index[num_duplicates_before + dedup_indexer] indexer = index.get_loc(indexer_end) return indexer

pandas-dev / pandas