python / cpython

The Python programming language
https://www.python.org
Other
63.36k stars 30.34k forks source link

set.add can return boolean indicate newly added item #58528

Closed da8fd79a-d72b-43c9-b0d1-e28ad73aa32f closed 12 years ago

da8fd79a-d72b-43c9-b0d1-e28ad73aa32f commented 12 years ago
BPO 14320
Nosy @gvanrossum, @rhettinger, @giampaolo, @merwok, @anacrolix, @akheron
Files
  • set-add-return-bool.patch
  • bench_set_add.py
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields: ```python assignee = 'https://github.com/rhettinger' closed_at = created_at = labels = ['interpreter-core', 'performance'] title = 'set.add can return boolean indicate newly added item' updated_at = user = 'https://github.com/anacrolix' ``` bugs.python.org fields: ```python activity = actor = 'giampaolo.rodola' assignee = 'rhettinger' closed = True closed_date = closer = 'rhettinger' components = ['Interpreter Core'] creation = creator = 'anacrolix' dependencies = [] files = ['24862', '24863'] hgrepos = [] issue_num = 14320 keywords = ['patch'] message_count = 4.0 messages = ['155909', '155929', '155948', '156005'] nosy_count = 6.0 nosy_names = ['gvanrossum', 'rhettinger', 'giampaolo.rodola', 'eric.araujo', 'anacrolix', 'petri.lehtinen'] pr_nums = [] priority = 'low' resolution = 'rejected' stage = None status = 'closed' superseder = None type = 'performance' url = 'https://bugs.python.org/issue14320' versions = ['Python 3.3'] ```

    da8fd79a-d72b-43c9-b0d1-e28ad73aa32f commented 12 years ago

    set.add can return True to indicate a newly added item to the set, or False if the item was already present.

    The C function PySet_Add returns -1 on error, and 0 on success currently. This is extended to return 1 if the item is newly added, and 0 if it was already present. Precedents exist for PySet_Contains and PySet_Discard with the same semantics. There are only 3 calls that need to be patched in the entire core, these are included in the patch.

    The Python function set.add currently returns None. This is extended to return True if if the item is new to the set, and False otherwise. No object overhead is introduced as set.add is already currently executing Py_RETURN_NONE.

    Benchmarks indicate that set.add if the item isn't already present (returning True in the patched instance) reduces time by 5-9%. set.add if the item already exists increases time by 1-3%. I'm happy to put these down to effects of touching True and False instead of None on return from set.add.

    Patch and primitive performance test attached.

    rhettinger commented 12 years ago

    The part of the patch for PySet_Add() is a reasonable improvement to the C API if it doesn't conflict with Martin's stable ABI effort.

    The question of whether to change the Python API requires much more thought and I'll do some research and evaluate it more thoroughly over the next few weeks. Here are some of the considerations:

        def dedup_before(iterable):
            '''Order preserving elimination of duplicates'''
            seen = set()
            for i in iterable:
                if i not in seen:
                    seen.add(i)
                    yield i
    
        def dedup_after(iterable):
            '''Order preserving elimination of duplicates'''
            seen = Set()
            for i in iterable:
                if seen.add(i):
                    yield i

    As you can see, there is more to API design than just spotting an opportunity to fold two steps into one.

    rhettinger commented 12 years ago

    Here's the results of the research so far:

    Guido gave it a -1: http://mail.python.org/pipermail/python-ideas/2012-March/014510.html

    Smalltalk does not return a boolean: http://www.gnu.org/software/smalltalk/manual-base/gst-base.html#Set http://www.gnu.org/software/smalltalk/manual-base/gst-base.html#HashedCollection add: newObject Add newObject to the set, if and only if the set doesn't already contain an occurrence of it. Don't fail if a duplicate is found. Answer anObject

    Java does return a boolean for Interface Set\<E>>: http://docs.oracle.com/javase/6/docs/api/ boolean add(E e) Adds the specified element to this set if it is not already present (optional operation)

    The SETL programming language does not return a boolean for the "set plus one element operation". Instead, it returns the new set. http://setl.org/setl/doc/setl-lib.html#with

    Real-world examples of code using set.add in a way that benefits from the boolean result do not look promising. They match the pattern in the minimal example shown in the previous post, "if e in previously_seen.add(e): ..." This is problematic because it can easily be read as "take this branch if e has been previously seen".

    I consulted with the PyPy developers and found that PyPy is already eliminating double lookups in common cases (it recognizes that the two lookups refer to the same entry and constant folds them). It cannot do this in the general case because of a potential semantic change (i.e. __eq__ can change the set during the initial lookup).

    Looking at Jython, it appears that both sets and dicts are based on Java's concurrent mapping object which doesn't support a boolean result directly but allows the JIT to optimize away the double lookup depending on the JIT and on whether the keys are of all the same type.

    One other note, in general the Python language makes no guarantees about atomicity (different implementations are free to make different implementation choices). For example, people relying on dict.setdefault() to be atomic were making an incorrect assumption (until recently). Our standing recommendation is to use locks unless your willing to rely on an implementation detail. In the case of set.add(), the race condition is harmless since set.add() is idempotent.

    I've shown sample code to some other developers and they had mixed guesses about the polarity of s.add(e), whether it would return True or False if the e was already in the set. A mixed result is not good and implies that the proposal is error prone.

    Also, when shown the dedup() code example, the devs didn't feel that the additional conciseness was worth it (one said, "it just looks funny and causes me to do a double-take, but the original code could be read effortlessly").

    My timings on the dedup() code showed a \<2% speedup on an iterable of strings with 80% duplicates and no improvement on an iterable of strings with 5% duplicates. The improvement was mostly erased if a "seenadd = seen.add" bound method was used inside the loop. This indicates that the double lookup is cheap compared to the cost of the dotted lookup in "seen.add(e)". That being said, objects with a custom \_hash would benefit from calling __hash only once.

    Given that the results above were uniformly negative (except for the Java library), I'll skip trying this out on my students and am rejecting the suggested change to the pure Python API. We appreciate your suggestion but it isn't appropriate for inclusion in the language.

    Some of those concerns may also apply to the suggested change to the C API

    da8fd79a-d72b-43c9-b0d1-e28ad73aa32f commented 12 years ago

    Is there still some value to at least exposing this in the C API, per the precedents I mentioned?

    The patch also contains some adjustment to the set_add_entry/set_add_key abstraction dance, and some future proofing of PySet_Add return values that have merit on their own.