python / cpython

The Python programming language
https://www.python.org
Other
63.43k stars 30.37k forks source link

`statistics.mode` fails for unhashable data #121977

Closed MarkRotchell closed 3 months ago

MarkRotchell commented 3 months ago

Bug report

Bug description:

statistics.mode uses collections.Counter so only works for data that are hashable

from statistics import mode
mode([{1,2},{1,2},{3}])

raisesTypeError: unhashable type: 'set'

Same issue with multimode.

This limitation is not mentioned in the docs or the docstring.

Potential solutions:

  1. Add the fact that the function doesn't work on unhashable types to the documentation, and raise a StatisticsError if the user provides unhashable data
  2. Come up with a separate algorithm for handling unhashable data. Would be interested to hear ideas on the appropriate algorithm.

CPython versions tested on:

CPython main branch

Operating systems tested on:

Windows

Linked PRs

rhettinger commented 3 months ago

FWIW, this is not a bug. It was an intentional design decision that gives good performance for typical use cases.

Marking this as a feature request to add a variant of mode() that supports unhashable inputs. For data with a total ordering, this could be implemented with an O(n log n) algorithm like sort data | uniq -c | head -1. For more challenging data, such as that in your example, an O(n**2) algorithm would be needed. We would need pretty strong use cases to warrant adding something like that:

def mode_general(data):
    'Quadratic algorithm. Depends only on equality tests.'
    return max(data, key=data.count)

data = [{1,2}, {1,2}, {3}]
print(mode_general(data))

For the existing mode(), we could add notes in the documentation about input restrictions. Or we can also suggest work-arounds. In your case, converting to a frozenset would be a reasonable option:

data = [{1,2}, {1,2}, {3}]
print(mode(map(frozenset, data)))
MarkRotchell commented 3 months ago

Really appreciate the feedback, thanks, it makes a lot of sense.

I think updating the docs is worthwhile, but agree an O(n**2) should be avoided.

It would be nice to do an O(n log n) algorithm as a fall back, though I'm struggling to figure out how we could detect whether the data has a total order. Obviously, we could check for data items that are sets, but they may be nested, e.g.,

print(sorted([({1,2},),({3},),({1,2},)]))
>>>[({1, 2},), ({3},), ({1, 2},)]

And user-defined objects would be more difficult. Perhaps we could use the O(n log n) algorithm for all non-hashable cases (sot that it works "out of the box" on, say, lists of lists) and specify in the docs that the behavior is not defined for data without a total order?

rhettinger commented 3 months ago

It would be nice to do an O(n log n) algorithm as a fall back

A fallback to sorting is perilous because we have no way to determine whether the input data has a total ordering. In particular, the set inputs in your example wouldn't work. Here's a quick demonstration of the hazard:

from itertools import chain, combinations, groupby
from random import shuffle

def powerset(iterable):
    "powerset([1,2,3]) → () (1,) (2,) (3,) (1,2) (1,3) (2,3) (1,2,3)"
    s = list(iterable)
    return chain.from_iterable(combinations(s, r) for r in range(len(s)+1))

reps = 3
data = list(map(set, powerset(range(4))))
data *= reps
shuffle(data)

data.sort()
for k, g in groupby(data):
    g = list(g)
    if len(g) != reps:
        print('Failed at:', g)
        print('Input:', data)

Also we generally we try to avoid silent degradations in performance. In this case, the O(n log n) sorting and O(n**2) counting approaches both make running time worse and they both require that the entire dataset be loaded into memory.

rhettinger commented 3 months ago

Draft edit:

   Only hashable inputs are supported.  To handle type :class:`set`,
   consider casting to :class:`frozenset`.  To handle type :class:`list`,
   consider casting to :class:`tuple`.  For mixed or nested inputs, consider
   using a slower quadratic algorithm that only depends on equality tests:
   ``max(data, key=data.count)``.
MarkRotchell commented 3 months ago

Could/should we have a go a pickling unhashable data before giving up?

def mode(data):

    try:
        pairs = Counter(iter(data)).most_common(1)
    except TypeError:
        try:
            return pickle.loads(mode(map(pickle.dumps, data)))
        except pickle.PickleError:
            raise StatisticsError('data must be all hashable or all pickleble') from None
    try:
        return pairs[0][0]
    except IndexError:
        raise StatisticsError('no mode for empty data') from None

my_sets = [{3},{1,2},{2,1}, {5,6}]
mode(my_sets)
>>>{1, 2}

Obviously that would take up memory, but if we'd otherwise be asking the user to create a new hashable instance of the objects, so would that.

MarkRotchell commented 3 months ago

Revision that takes account of:

  1. the pickle for some objects could be very large in memory, so lets hash it to something smaller before we store it
  2. the necessity to deal with hash collisions from the above
  3. the possiblilty that data could be an iterable, so we can't use it twice
  4. we should really be testing whether two objects are equal, not just their pickles
def mode(data):
    hashable_data, pickleable_data = itertools.tee(data, 2)
    try:
        pairs = Counter(hashable_data).most_common(1)
    except TypeError:
        try:
            hash_map = {}
            counter = Counter()
            for value in pickleable_data:
                key = hash(pickle.dumps(value))

                # deal with hash collisions (unlikely, but possible)
                while key in hash_map:
                    if hash_map[key] == value:
                        break
                    else:
                        key += 1
                else:
                    hash_map[key] = value
                counter[key] += 1

            # no need to check for IndexError as empty data wouldn't have raised TypeError in first place
            return hash_map[counter.most_common(1)[0][0]]
        except pickle.PickleError:
            raise StatisticsError('data must be all hashable or all pickleble') from None
    try:
        return pairs[0][0]
    except IndexError:
        raise StatisticsError('no mode for empty data') from None
rhettinger commented 3 months ago

Could/should we have a go a pickling unhashable data before giving up?

I appreciate your effort but will decline. What we already have meets most needs most of the time and it is not worth walking out on thin ice with a pickling hack.

FWIW there really isn't anything special about mode(). A user also encounters unhashable error messages for dicts, sets, Counter, and many other tools. This is a typical restriction for the language and its standard library. It doesn't seem to be a problem in practice, and it seems to encourage structuring data so that it can be processed efficiently.

MarkRotchell commented 3 months ago

Thanks for this, Raymond. I've watched many of your talks on YouTube, and it's been lovely interacting with you.