"Using key in list to check if key is contained in list" -- Nitpick

codethief commented 3 years ago

https://docs.quantifiedcode.com/python-anti-patterns/performance/using_key_in_list_to_check_if_key_is_contained_in_a_list.html currently claims the following code,

s = set([1, 2, 3, 4])

if 3 in s:
    print("The number 3 is in the list.")
else:
    print("The number 3 is NOT in the list.")

is more efficient than one where s is a list. Strictly speaking, this is wrong. When creating the set, the set() constructor first needs to compare all elements with one another to eliminate duplicates, leading to a time complexity of O(n²). Meanwhile, s in mylist has complexity O(n).

Some numbers:

$ python3 -m timeit "l = list(range(1000000)); print(500 in l)"
20 loops, best of 5: 15 msec per loop
$ python3 -m timeit "l = set(range(1000000)); print(500 in l)"
10 loops, best of 5: 35.4 msec per loop

(I had to use a large list/set here to get a reliable time measurement.)

Sure s in myset alone has average complexity O(1), so I definitely understand point the text is trying to make. I guess the above set creation is simply undermining it. :)

On a side note, the set creation in the code example can do without the list literal and be simplified to:

s = {1, 2, 3, 4}
# or, like I wrote in my time measurement,
s = set(range(1000000))

which one might also call more idiomatic than set([1, 2, 3, 4]). (Obviously, this doesn't change the complexity, i.e. the performance characteristics stay the same outside of no longer having to allocate memory for the list, because the set() constructor still needs to filter out duplicates.)

haakonvt commented 3 years ago

When creating the set, the set() constructor first needs to compare all elements with one another to eliminate duplicates, leading to a time complexity of O(n²).

Pretty sure converting a sequence to a set has linear time complexity? Each add is O(1), so N inserts result in O(n), right?

codethief commented 3 years ago

You're right, looks like I had a serious brain fart. Judging by https://github.com/python/cpython/blob/main/Objects/setobject.c (in particular, the functions make_new_set(), set_update_internal(), set_add_key() and set_add_entry()), set creation is O(n) due to the set hashing every new item and putting it inside a hash table (in the form of a simple C array). The hash function still produces additional (but constant) costs compared to list creation (which is O(n) as well), leading to a slightly worse linear factor in front of the O(n) dependency, which therefore explains the time difference in my measurements. But of course it's not O(n²), as I had claimed. Sorry about that!

quantifiedcode / python-anti-patterns

"Using key in list to check if key is contained in list" -- Nitpick #157