Closed codethief closed 3 years ago
When creating the set, the
set()
constructor first needs to compare all elements with one another to eliminate duplicates, leading to a time complexity of O(n²).
Pretty sure converting a sequence to a set has linear time complexity? Each add is O(1)
, so N inserts result in O(n)
, right?
You're right, looks like I had a serious brain fart. Judging by https://github.com/python/cpython/blob/main/Objects/setobject.c (in particular, the functions make_new_set()
, set_update_internal()
, set_add_key()
and set_add_entry()
), set creation is O(n) due to the set hashing every new item and putting it inside a hash table (in the form of a simple C array). The hash function still produces additional (but constant) costs compared to list creation (which is O(n) as well), leading to a slightly worse linear factor in front of the O(n) dependency, which therefore explains the time difference in my measurements. But of course it's not O(n²), as I had claimed. Sorry about that!
https://docs.quantifiedcode.com/python-anti-patterns/performance/using_key_in_list_to_check_if_key_is_contained_in_a_list.html currently claims the following code,
is more efficient than one where
s
is a list. Strictly speaking, this is wrong. When creating the set, theset()
constructor first needs to compare all elements with one another to eliminate duplicates, leading to a time complexity of O(n²). Meanwhile,s in mylist
has complexity O(n).Some numbers:
(I had to use a large list/set here to get a reliable time measurement.)
Sure
s in myset
alone has average complexity O(1), so I definitely understand point the text is trying to make. I guess the above set creation is simply undermining it. :)On a side note, the set creation in the code example can do without the list literal and be simplified to:
which one might also call more idiomatic than
set([1, 2, 3, 4])
. (Obviously, this doesn't change the complexity, i.e. the performance characteristics stay the same outside of no longer having to allocate memory for the list, because theset()
constructor still needs to filter out duplicates.)