Closed MarkRotchell closed 3 months ago
FWIW, this is not a bug. It was an intentional design decision that gives good performance for typical use cases.
Marking this as a feature request
to add a variant of mode()
that supports unhashable inputs. For data with a total ordering, this could be implemented with an O(n log n)
algorithm like sort data | uniq -c | head -1
. For more challenging data, such as that in your example, an O(n**2)
algorithm would be needed. We would need pretty strong use cases to warrant adding something like that:
def mode_general(data):
'Quadratic algorithm. Depends only on equality tests.'
return max(data, key=data.count)
data = [{1,2}, {1,2}, {3}]
print(mode_general(data))
For the existing mode()
, we could add notes in the documentation about input restrictions. Or we can also suggest work-arounds. In your case, converting to a frozenset
would be a reasonable option:
data = [{1,2}, {1,2}, {3}]
print(mode(map(frozenset, data)))
Really appreciate the feedback, thanks, it makes a lot of sense.
I think updating the docs is worthwhile, but agree an O(n**2)
should be avoided.
It would be nice to do an O(n log n)
algorithm as a fall back, though I'm struggling to figure out how we could detect whether the data has a total order. Obviously, we could check for data items that are sets, but they may be nested, e.g.,
print(sorted([({1,2},),({3},),({1,2},)]))
>>>[({1, 2},), ({3},), ({1, 2},)]
And user-defined objects would be more difficult. Perhaps we could use the O(n log n)
algorithm for all non-hashable cases (sot that it works "out of the box" on, say, lists of lists) and specify in the docs that the behavior is not defined for data without a total order?
It would be nice to do an O(n log n) algorithm as a fall back
A fallback to sorting is perilous because we have no way to determine whether the input data has a total ordering. In particular, the set
inputs in your example wouldn't work. Here's a quick demonstration of the hazard:
from itertools import chain, combinations, groupby
from random import shuffle
def powerset(iterable):
"powerset([1,2,3]) → () (1,) (2,) (3,) (1,2) (1,3) (2,3) (1,2,3)"
s = list(iterable)
return chain.from_iterable(combinations(s, r) for r in range(len(s)+1))
reps = 3
data = list(map(set, powerset(range(4))))
data *= reps
shuffle(data)
data.sort()
for k, g in groupby(data):
g = list(g)
if len(g) != reps:
print('Failed at:', g)
print('Input:', data)
Also we generally we try to avoid silent degradations in performance. In this case, the O(n log n)
sorting and O(n**2)
counting approaches both make running time worse and they both require that the entire dataset be loaded into memory.
Draft edit:
Only hashable inputs are supported. To handle type :class:`set`,
consider casting to :class:`frozenset`. To handle type :class:`list`,
consider casting to :class:`tuple`. For mixed or nested inputs, consider
using a slower quadratic algorithm that only depends on equality tests:
``max(data, key=data.count)``.
Could/should we have a go a pickling unhashable data before giving up?
def mode(data):
try:
pairs = Counter(iter(data)).most_common(1)
except TypeError:
try:
return pickle.loads(mode(map(pickle.dumps, data)))
except pickle.PickleError:
raise StatisticsError('data must be all hashable or all pickleble') from None
try:
return pairs[0][0]
except IndexError:
raise StatisticsError('no mode for empty data') from None
my_sets = [{3},{1,2},{2,1}, {5,6}]
mode(my_sets)
>>>{1, 2}
Obviously that would take up memory, but if we'd otherwise be asking the user to create a new hashable instance of the objects, so would that.
Revision that takes account of:
data
could be an iterable, so we can't use it twice def mode(data):
hashable_data, pickleable_data = itertools.tee(data, 2)
try:
pairs = Counter(hashable_data).most_common(1)
except TypeError:
try:
hash_map = {}
counter = Counter()
for value in pickleable_data:
key = hash(pickle.dumps(value))
# deal with hash collisions (unlikely, but possible)
while key in hash_map:
if hash_map[key] == value:
break
else:
key += 1
else:
hash_map[key] = value
counter[key] += 1
# no need to check for IndexError as empty data wouldn't have raised TypeError in first place
return hash_map[counter.most_common(1)[0][0]]
except pickle.PickleError:
raise StatisticsError('data must be all hashable or all pickleble') from None
try:
return pairs[0][0]
except IndexError:
raise StatisticsError('no mode for empty data') from None
Could/should we have a go a pickling unhashable data before giving up?
I appreciate your effort but will decline. What we already have meets most needs most of the time and it is not worth walking out on thin ice with a pickling hack.
FWIW there really isn't anything special about mode().
A user also encounters unhashable error messages for dicts, sets, Counter, and many other tools. This is a typical restriction for the language and its standard library. It doesn't seem to be a problem in practice, and it seems to encourage structuring data so that it can be processed efficiently.
Thanks for this, Raymond. I've watched many of your talks on YouTube, and it's been lovely interacting with you.
Bug report
Bug description:
statistics.mode
usescollections.Counter
so only works for data that are hashableraises
TypeError: unhashable type: 'set'
Same issue with
multimode
.This limitation is not mentioned in the docs or the docstring.
Potential solutions:
StatisticsError
if the user provides unhashable dataCPython versions tested on:
CPython main branch
Operating systems tested on:
Windows
Linked PRs