Closed daf46b87-a9e9-4381-bf23-8c373c9135e6 closed 4 years ago
It'd be helpful of the functions in the "bisect" modules will have a "key" argument just like "sort".
This request has come up repeatedly (and been rejected) in the past. See issues 2954, 3374, 1185383, 1462228, 1451588, 1619060.
Could you perhaps explain your particular use case for this? A few truly convincing use-cases might increase the chances of this getting accepted.
Miki, the issue is that bisect calls tend to be made repeatedly, so the key function can be called over and over again for the same argument. It is almost always a better design to simply decorate the list so the key function never gets called more than once per element in the list. If we added key= to bisect, it would encourage bad design and steer people after from better solutions.
What about cases where performance is unimportant, or where the key function is fast (e.g. an attribute access)? Then something like
bisect(a, x, key=attrgetter('size'))
is easy to write and read. Mightn't this be considered good design, from some perspectives?
Another thought: if your list is a list of user-defined objects then a natural way to do the 'decorate' step of DSU might be to add a 'key' attribute to each object, rather than the usual method of constructing pairs. (This has the advantage that you might not have to bother with the 'undecorate' step.) With a key argument, bisect could make use of this technique too.
Disclaimer: I haven't personally had any need for a key argument on bisect, so all this is hypothetical. That's why I'm asking for real use- cases.
I had said "almost always". Sure, if you don't care about performance or scalability, a key= argument would be a net win.
We're responsible for creating an API that steers most programmers in the right direction (Tim sez "we read Knuth so you don't have to"). Algorithmically, the bisect functions are at the wrong level of granularity for applying a key function.
For user-defined objects, there is no need for a key-attribute since can just supply a custom comparison method:
class UserDefined:
. . .
def cmp(self, other):
return cmp(self.key, other.key)
One case I've been thinking about is that of maintaining a list of Decimal objects that are sorted by absolute value. For this, having to create a list of (abs(x), x) pairs just seems clumsy compared to using a key argument to bisect.
Perhaps this is a contrived use case, but it doesn't seem totally implausible.
I agree you can get around this with defining __cmp__, however same goes to "sort" and it was added anyway.
My take on it is that sometimes I need to find something in a big list of objects, and I don't like to do DSU and not add __cmp__ to the objects (since some of the lists might be sorted by different attributes
I'd prefer if we do implement "key" and add a warning in the docs it might slow you down. Which what will happen in the case of __cmp__ anyway.
I don't see why the "key" function should be called all the time on inserted item, it's very easy to cache this value
def bisect(a, x, lo=0, hi=None, key=lambda x: x):
assert low >= 0, "low must be non-negative"
hi = hi or len(a)
x_key = key(x)
while lo < hi:
mid = (lo+hi)//2
if x_key < key(a[mid]): hi = mid
else: lo = mid+1
return lo
(I'd also wish for "identity" built in, however this is another subject :)
Just a reminder that __cmp is gone in 3.0. I presume bisect, like sort, only requires __lt and perhaps __eq__, though I can find no doc of either.
I've been bugged by the lack of key= argument for bisect for some time now, and today I got to read this and the previous issues about the matter. I still fail to understand the reasons for the rejections. It might encourage bad design in which expensive key functions will get called more than once for the same element. However:
I was writing a module that keeps a sorted list of items in which items can be added, removed, or changed (they might get reordered). I used bisect to avoid useless slow linear searches, which would be one obstacle for the app to scale well. However, the sort key can be changed at any time, so implementing __lt__ wasn't an option. Keeping a second list with the keys is both memory and computationally expensive as inserting into a list is slow. It's not a shiny example of a bisect use-case as the only thing I've used my module so far is to display the files in a directory, which means the lists are small and the changes are seldom, so any implementation is good enough. But bisect with key= would have been best if it was available. What's worse, I have a buggy unreadable ad hoc implementation of bisect in my code right now. ;)
I see the following as reasons why key= would provide benefit:
I've implemented key= and quickly hacked a benchmark to show what performance hit it might have, and to put things into perspective. The results:
Cheap key function, integers and abs(), 1000000 items, 5000 bisects or insorts: a) Bisect with a key: 0.02205014s b) Bisect with a second list: 0.01328707s c) Insort with a key: 5.83688211s d) Bisect with a second list, and two inserts: 16.17302299s
Cheap key function, \~4000 char bytestrings and len(), 100000 items, 5000 bisects or insorts: a) Bisect with a key: 0.01829195s b) Bisect with a second list: 0.00945401s c) Insort with a key: 0.25511408s d) Bisect with a second list, and two inserts: 0.49303603s
Expensive key function, \~4000 char bytestrings and str.lower(), 100000 (500 MB) items, 5000 bisects or insorts: a) Bisect with a key: 1.26837015s b) Bisect with a second list: 0.08390594s c) Insort with a key: 1.50406289s d) Bisect with a second list, and two inserts: 0.57737398s
Expensive key function, \~4000 char bytestrings and str.lower(), 500000 (2.5 GB) items, 5000 bisects or insorts: a) Bisect with a key: 1.46136308s b) Bisect with a second list: 0.08768606s c) Insort with a key: 3.05218720s d) Bisect with a second list, and two inserts: 6.43227386s
Expensive key function, \~4000 char bytestrings and str.strip(), 100000 (500 MB) items, 5000 bisects or insorts: a) Bisect with a key: 0.03311396s b) Bisect with a second list: 0.01374602s c) Insort with a key: 0.27423000s d) Bisect with a second list, and two inserts: 0.49585080s
Expensive key function, \~4000 char bytestrings and str.strip(), 500000 (2.5 GB) items, 5000 bisects or insorts: a) Bisect with a key: 0.04530501 b) Bisect with a second list: 0.01912594 c) Insort with a key: 1.62209797 d) Bisect with a second list, and two inserts: 5.91734695
Also, I tried to bench linear searches, but as they had to run in Python code they aren't representative of anything. In the integer test they went for about 250 seconds without recalculating the key, and for about 500 with. Also, replacing them with .index() gave about 60 seconds if I ensured there's high probability that the number is in the list, and for 500 if I didn't.
In short, key= for bisect would be convenient and neat, really useful in rare cases, leading to more readable code in the most common cases, and I'm unconvinced of the perceived harm that it would cause.
I attach a patch implementing it for trunk, py3k, and the benchmark script I used to test it (on trunk).
This was closed over a year ago, but since mark.dickinson was asking for convincing use-cases: I'm breaking up a file into line-delimited chunks. These chunks are non-overlapping, contiguous, and tend to be fairly large, so I'm just recording the start line of each chunk in a 2-ple:
mapping = [
(10, 'first chunk'),
(50, 'second chunk'),
(60, 'third chunk')
]
Lines 10-49 are in the first chunk, lines 50-59 are in the second, lines 60+ are in the third. So:
def CategorizeLine(line, mapping):
loc = bisect.bisect([m[0] for m in mapping], line)
if loc == 0:
return None # before first chunk
return mapping[loc-1][1]
It Would Be Nice if I could write the second line as:
loc = bisect.bisect(mapping, line, key=lambda m:m[0])
The bisect documentation suggests pre-computing the key list, but it seems messy and error-prone to keep a redundant data structure in sync with its source. I could also rewrite my "mapping" data structure to be two parallel lists instead of one list of 2-ples, but this data structure is more readable and extensible and handles empty lists more naturally.
I'll take another look at this one.
For what it's worth, after I posted my comment, I realized I could use tuple comparison semantics:
loc = bisect.bisect(mapping, (line,))
since my key happens to be at index 0. "key=" would still be nice.
What would you guys think about adding a class that uses bisect internally so that the keyfunc() never gets called more than once per key? This could be added to the docs as a cut-and-pastable recipe or it could be added to the module directly:
sd = SortedCollection(iterable, key=keyfunc)
s[i] --> getitem at position i
len(s) --> length
sd.insert_left(item) --> None
sd.insert_right(item) --> None
sd.find_left(key) --> item
sd.find_right(key) --> item
sd.keyfunc --> the key function
list(sd) --> sorted list of items
list(reversed(sd)) --> reverse sorted list of items
s.clear()
The implementation would follow this pattern:
def insert_left(self, item):
key = self._keyfunc(item)
i = bisect_left(self._keys, key)
self._keys.insert(i, key)
self._items.insert(i, item)
def find_left(self, key):
i = bisect_left(self._keys, key)
return self._items[i]
def __iter__(self):
return iter(self._items)
def __getitem__(self, i):
return self._items[i]
Attaching a draft of a sorted collection class.
Any comments?
Looks nice to me, however I wonder if there isn't some overlap with the requests to add a key= kwarg to heapq methods (and the discussion about adding a Heap class there).
Added a link to the a SortedCollections recipe, added example of how to do searches, and noted the O(n) performance of the insort() functions.
See r83786
This issue came up on #python IRC, and that combined with the number of times this has been duplicated makes me think that maybe the mention of the SortedCollection recipe should be a little more prominent.
Perhaps either moved up by the method list, or something like: "Note: No key argument is available because of performance issues. Please consider either the SortedCollection recipe or altering the objects comparison methods.
? It just seems like it keeps coming up, and in this case came up after the recipe reference was added.
maybe the mention of the SortedCollection recipe should be a little more prominent.
Thanks for the suggestion. Will look at moving the note higher on the page.
Fixed in r84383.
On python-ideas, thread "Optional key to bisect
's functions?" Guido wrote "Bingo. That clinches it. We need to add key=." 'That' being the fact that values that have keys may not be comparable themselves (in py3), so that comparing (key,value) pairs may raise TypeError. This follows a reply by him yesterday saying that he has wanted this feature occasionally. I am re-opening this rather than a new issue because there is already a patch with tests.
I'm +1 on adding this feature. See discussion in python-ideas.
PS. It should also be added to heapq.
Already on tracker as bpo-13742 Add a key parameter (like sorted) to heapq.merge
I've worked around this in 2.6/2.7 like this:
class Arr:
def __getitem__(self, i):
return foo(i) # your key function
def __len__(self):
return 10000000 # your max index value
bisect.bisect(Arr(), value, ...)
Use case: a custom immutable array with a large number of items and indirect key field access. For example ctypes.array, memoryview or ctypes.pointer or any other custom container.
I'm not sure how anyone can consider a precached key array as a right ans scalable answer. It is just a ridiculuos idea. Precashing key array is a O(N) operation. While bisect is O(log(N)).
@Raymond There is a statement that "adding 'key()' would encourage bad design and steer people after from better solutions." Well, right now, the design that is being encouraged results in O(N) code. Because lazy developers are just 'pre-cacaching' (copying) the keys!
There is a statement that key() have to be called multiple times per item. What? Why?
There is a statement that one can always add a _cmp_ function to an object. This is ridiculuous. The object could be immutable. There should be no need to modify the object/array when all that you need to do is to bisect it.
In the meantime here is a workaround
https://gist.github.com/ericremoreynolds/2d80300dabc70eebc790
I've just encountered another case where the lack of a key on bisect has led to more complicated and error-prone code than necessary.
Quick summary: we've got a list containing an ordered collection of non-overlapping open intervals on the real line. (In case anyone wants to know, the intervals represent the depths of chunks of interest in a rock core sample.) There's then code for splitting an interval into two pieces at a given point, and for merging two adjacent intervals. Splitting involves (i) finding the relevant interval using a bisect search based on the left endpoint of each interval, then (ii) replacing that interval with two new intervals in the list.
The fact that the list is being modified after every bisect search makes it messy to cache the left endpoints, since that cache has to be updated along with the list at every stage. The operations are also relatively rare, which makes it feel inefficient to be computing *all* the left endpoints of the intervals in the first place.
Adding comparisons to our interval class is doable, but invasive and unnatural, since the actual class carries other pieces of data and there are many possible meanings for <
with respect to that class: it doesn't make sense to hard-code the comparison with respect to depths in that class's __lt__
method.
So I think there's a strong case for a key argument in some future version of Python.
[Of course, for our app we're on Python 2.7, so this issue won't help us directly. We're probably going to go with either reimplementing bisect or using a proxy array like the one suggested by Eric Reynolds.]
I'll add a key= variant for Python 3.6.
I'm also looking for bisect with a key argument (numpy array of records, want to search on one element). However, I don't see bisect in the what's new: https://docs.python.org/3.6/whatsnew/3.6.html ? Any luck with the implementation?
obviously didn't make it in 3.6 but this still seems desirable. I just saw someone at work propose a trivial port of golang's sort.Search - https://golang.org/pkg/sort/#Search - in Python which caused me to hunt for an issue on bisect key= support.
Hi everybody, I opened PR 11781 to add a key argument to functions in the bisect module. I agree with @dmtr's points that this addition is not a bad design.
As far as I can tell, the key function is at called at most once per item as this example where an assertion would break shows:
import bisect
from collections import defaultdict
class Test:
def __init__(self, value):
self.value = value
cache = defaultdict(int)
def key(e):
cache[e] += 1
assert cache[e] <= 1
return e.value
l = [Test(i) for i in range(10000)]
bisect.bisect(l, Test(25), key=key)
➜ cpython git:(add-key-argument-to-bisect) ./python.exe
Python 3.8.0a1+ (heads/add-key-argument-to-bisect:b7aaa1adad, Feb 7 2019, 17:33:24)
[Clang 10.0.0 (clang-1000.10.44.4)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import bisect
>>> from collections import defaultdict
>>>
>>>
>>> class Test:
... def __init__(self, value):
... self.value = value
...
>>>
>>> cache = defaultdict(int)
>>>
>>> def key(e):
... cache[e] += 1
... assert cache[e] <= 1
... return e.value
...
>>>
>>> l = [Test(i) for i in range(10000)]
>>>
>>> bisect.bisect(l, Test(25), key=key)
26
This argument can be used where the objects are immutable and I have not been able to see changes in bisect speed using Victor Stinner perf module:
(cpython-venv) ➜ cpython git:(add-key-argument-to-bisect) ✗ make distclean && ./configure --with-pydebug --with-openssl=$(brew --prefix openssl) && make -sj && python -m perf timeit --rigorous -s "from bisect import bisect" "bisect(range(1_000_000_000_000_000), 25)" -o $(git rev-parse --short HEAD).json (cpython-venv) ➜ cpython git:(add-key-argument-to-bisect) ✗ git checkout cd90f6a369 (cpython-venv) ➜ cpython git:(cd90f6a369) ✗ make distclean && ./configure --with-pydebug --with-openssl=$(brew --prefix openssl) && make -sj && python -m perf timeit --rigorous -s "from bisect import bisect" "bisect(range(1_000_000_000_000_000), 25)" -o $(git rev-parse --short HEAD).json (cpython-venv) ➜ cpython git:(cd90f6a369) ✗ python -m perf compare_to cd90f6a369.json b7aaa1adad.json Mean +- std dev: [cd90f6a369] 36.2 us +- 1.0 us -> [b7aaa1adad] 35.7 us +- 0.5 us: 1.01x faster (-1%)
(cd90f6a369 was somtime faster than b7aaa1adad, sometime slower but they always were less than one std dev from one another)
As I wrote in the discussion of the PR, I suspect the branch predictor to predict reliably the branching in the hot path (though I don't know much about that and would love some input).
For the record, here is the performance when a key function is given:
(cpython-venv) ➜ cpython git:(add-key-argument-to-bisect) ✗ python -m perf timeit --rigorous -s "from bisect import bisect" "bisect(range(1_000_000_000_000_000), 25, key=lambda e: e)"
......................................... Mean +- std dev: 59.3 us +- 1.0 us
It seems to me that adding the key parameter is the best solution possible.
Hi, there has been renewed interest from @alexchamberlain and @f6v on GitHub for this feature. Would it be possible to get the patch reviewed so we can add it in 3.8?
I'll look at it this weekend.
Can anyone add "reverse" support? Key and reverse support are both functional requirement.
I think it could be done with key=lambda item: -item
if the key argument is added.
The problem with key=lambda item: -item
is that item cannot always be easily negated. For example, tuples are often used as keys.
Can you not use functools.cmp_to_key() for this?
Here's an example:
>>> import bisect, functools
>>> l = [('f', 5), ('d', 3), ('c', 2), ('b', 1), ('a', 0)]
>>> def cmp(a, b):
... if a > b: return -1
... if a < b: return 1
... return 0
...
>>> bisect.bisect(l, ('e', 4), key=functools.cmp_to_key(cmp))
1
>>> l
[('f', 5), ('d', 3), ('c', 2), ('b', 1), ('a', 0)]
>>> bisect.insort(l, ('e', 4), key=functools.cmp_to_key(cmp))
>>> l
[('f', 5), ('e', 4), ('d', 3), ('c', 2), ('b', 1), ('a', 0)]
I opened a relevant PR, https://github.com/python/cpython/pull/11781.
I believe a key parameter is inferior to a comparison callback. The former is a specific case of the latter, and in my use case would force me to create another class to serve as a comparator for my objects, at which point I might as well wrap them and add __lt__.
Furthermore, I believe this is more in-line with similar standard functions in other languages such as C++ (std::sort), Java (PriorityQueue) or Rust (slice.sort_by).
I opened a relevant PR, https://github.com/python/cpython/pull/11781.
Did you mean https://github.com/python/cpython/pull/13970 ?
I did, thanks!
New changeset 871934d4cf00687b3d1411c6e344af311646c1ae by Raymond Hettinger in branch 'master': bpo-4356: Add key function support to the bisect module (GH-20556) https://github.com/python/cpython/commit/871934d4cf00687b3d1411c6e344af311646c1ae
New changeset 1aaa85949717e4ab2ed700e58762f0a3ce049a37 by Pablo Galindo Salgado in branch 'main': bpo-4356: Mention the new key arguments for the bisect module APIs in the 3.10 What's new (GH-28339) https://github.com/python/cpython/commit/1aaa85949717e4ab2ed700e58762f0a3ce049a37
New changeset dda5ff2d095c795f00afaa64505069a2409f6099 by Miss Islington (bot) in branch '3.10': bpo-4356: Mention the new key arguments for the bisect module APIs in the 3.10 What's new (GH-28339) (GH-28340) https://github.com/python/cpython/commit/dda5ff2d095c795f00afaa64505069a2409f6099
New changeset 7128864885340e0d75ddfe32887257e245d9c1f7 by Pablo Galindo (Miss Islington (bot)) in branch '3.10': bpo-4356: Mention the new key arguments for the bisect module APIs in the 3.10 What's new (GH-28339) (GH-28340) https://github.com/python/cpython/commit/7128864885340e0d75ddfe32887257e245d9c1f7
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields: ```python assignee = 'https://github.com/rhettinger' closed_at =
created_at =
labels = ['type-feature', 'library', '3.10']
title = 'Add "key" argument to "bisect" module functions'
updated_at =
user = 'https://github.com/tebeka'
```
bugs.python.org fields:
```python
activity =
actor = 'pablogsal'
assignee = 'rhettinger'
closed = True
closed_date =
closer = 'rhettinger'
components = ['Library (Lib)']
creation =
creator = 'tebeka'
dependencies = []
files = ['15248', '15249', '15250', '16944']
hgrepos = []
issue_num = 4356
keywords = ['patch']
message_count = 43.0
messages = ['76060', '76073', '76084', '76092', '76098', '76099', '76203', '76230', '94836', '103143', '103145', '103155', '103158', '103180', '103294', '113222', '115176', '115189', '115288', '152967', '152969', '152983', '208020', '237408', '242267', '246371', '246384', '261007', '305518', '335841', '337737', '337741', '343529', '343544', '343548', '343549', '345220', '345227', '345228', '379086', '401796', '401798', '403173']
nosy_count = 22.0
nosy_names = ['rhettinger', 'terry.reedy', 'gregory.p.smith', 'jafo', 'tebeka', 'mark.dickinson', 'alex', 'milko.krachounov', 'dmtr', 'bls', 'martin.panter', 'Dima.Tisnek', 'yselivanov', 'NeilGirdhar', 'josh.r', 'ericreynolds', 'wumpus', 'pablogsal', 'miss-islington', 'remi.lapeyre', 'gpery', 'flyingosprey']
pr_nums = ['11781', '11781', '20556', '28339', '28340']
priority = 'normal'
resolution = 'fixed'
stage = 'resolved'
status = 'closed'
superseder = None
type = 'enhancement'
url = 'https://bugs.python.org/issue4356'
versions = ['Python 3.10']
```