grantjenks / python-sortedcontainers

Python Sorted Container Types: Sorted List, Sorted Dict, and Sorted Set
http://www.grantjenks.com/docs/sortedcontainers/
Other
3.51k stars 203 forks source link

Feature request: SortedDict Range Deletion #180

Open ThomasShaw1 opened 2 years ago

ThomasShaw1 commented 2 years ago

Range key deletion feature is currently missing for SortedDict. If I want to delete multiple keys within an range, I have to do them one by one using a for loop which is very inefficient in terms of time complexity. See my example below.

# program that removes the subdict from 2 to 4
d = SortedDict({1: 1, 2: 2, 3: 3, 4: 4, 5: 5})
lower_bound_index = d.bisect_left(2)
upper_bound_index = d.bisect_left(4)
for _ in range(lower_bound_index, upper_bound_index + 1):
   d.popitem(lower_bound_index)

This is very inefficient both in terms of usage and in terms of time complexity. Time complexity for this one is O(k log n) where k is the range size. Other languages such as C++ and Java only needs an time complexity of O(k + log n) to accomplish the job with the API mentioned below.

In C++, you have the option of calling erase(iterator first, iterator last) for their map data structure. Similarly, in Java, you can call subMap(first, last).clear() for their TreeMap data structure. Something similar should be available for SortedDict in my opinion.

ThomasShaw1 commented 2 years ago

Here's another hacky way of doing it. Not sure if it is better in terms of performance. But still, an API would be nice.

def play():
    sd = SortedDict({1: 1, 2: 2, 3: 3, 4: 4, 5: 5})
    low = sd.bisect_left(2)
    high = sd.bisect_left(4)
    for key in sd._list[low:high + 1]:
        super(SortedDict, sd).pop(key)
    del sd._list[low:high + 1]
    print(sd)
grantjenks commented 2 years ago

Having a kind of range_delete makes sense to me. There’s a variety of optimizations that could be done inside the data structure too.

Note that today you can use irange instead of bisect to avoid using the position index.

grantjenks commented 2 years ago

Here's the best I can come up with today (without internal optimizations):

$ python -m IPython
Python 3.9.1 (v3.9.1:1e5d33e9b9, Dec  7 2020, 12:10:52) 
Type 'copyright', 'credits' or 'license' for more information
IPython 7.29.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import sortedcontainers

In [2]: class SortedDict(sortedcontainers.SortedDict):
   ...:     def range_del(self, minimum=None, maximum=None, inclusive=(True, True)):
   ...:         iterator = self.irange(minimum, maximum, inclusive)
   ...:         keys = list(iterator)
   ...:         for key in keys:
   ...:             del self[key]
   ...: 

In [3]: sd = SortedDict({'a': 1, 'b': 2, 'c': 3, 'd': 4, 'e': 5})

In [4]: sd
Out[4]: SortedDict({'a': 1, 'b': 2, 'c': 3, 'd': 4, 'e': 5})

In [5]: sd.range_del('b', 'd')

In [6]: sd
Out[6]: SortedDict({'a': 1, 'e': 5})
ThomasShaw1 commented 2 years ago

This is great in terms of simplicity. But I would still urge you to take a look at the hack implementation I wrote above. The main idea is that slice deletion for SortedList is already internally optimized to O(log n), which is used for keys, so why not just use it as it is and then take care of the dictionary deletion one by one. That way, the time complexity is O(k + log n) instead of O(k log n), which is inline with Sorted Maps in Java and C++ and other languages.

grantjenks commented 2 years ago

The performance of the two will differ depending on the size of the dictionary and how much is deleted. But I generally agree that an optimized version would be a good addition to the API.

The best improvement to your version would be to avoid building the positional index. If you like at the implementation of SortedList.irange, you can see how to do so.

ThomasShaw1 commented 2 years ago

But how would I achieve slice deletion if I don't have the positional indices? SortedList.__delitem__ only takes index or slice. It is only through slice deletion that we achieve internal optimization.

MacherelR commented 7 months ago

Hello, I'm currently facing the exact same problem and I was wondering whether any of you (@ThomasShaw1) had found some optimized solution ? I've tried the range_del @grantjenks wrote but it doesn't seem really optimized as it doesn't accelerate the deletion.

Best regards