python / cpython

The Python programming language
https://www.python.org
Other
63.2k stars 30.27k forks source link

Implement partial order on Counter #66705

Closed 06547701-06a2-4e4a-b672-16a33475101e closed 10 years ago

06547701-06a2-4e4a-b672-16a33475101e commented 10 years ago
BPO 22515
Nosy @rhettinger, @mdickinson, @pitrou, @scoder, @ericvsmith, @stevendaprano, @bitdancer, @cool-RR, @asmeurer, @ethanfurman, @serhiy-storchaka, @MojoVampire
Files
  • patch.patch
  • patch2.patch
  • issue22515.stoneleaf.patch.01
  • issue22515_partial_order_counter_v2.diff
  • issue22515_partial_order_counter_v3.diff
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields: ```python assignee = 'https://github.com/rhettinger' closed_at = created_at = labels = ['type-feature', 'library'] title = 'Implement partial order on Counter' updated_at = user = 'https://github.com/cool-RR' ``` bugs.python.org fields: ```python activity = actor = 'Aaron.Meurer' assignee = 'rhettinger' closed = True closed_date = closer = 'rhettinger' components = ['Library (Lib)'] creation = creator = 'cool-RR' dependencies = [] files = ['36790', '36792', '36794', '36796', '36801'] hgrepos = [] issue_num = 22515 keywords = ['patch'] message_count = 66.0 messages = ['227819', '227824', '227827', '227828', '227829', '227831', '227842', '227844', '227845', '227846', '227847', '227848', '227849', '227850', '227851', '227852', '227857', '227858', '227860', '228071', '228072', '228077', '228082', '228083', '228084', '228085', '228086', '228087', '228088', '228089', '228090', '228094', '228096', '228097', '228098', '228099', '228100', '228101', '228102', '228105', '228236', '228312', '228374', '228391', '228398', '228401', '228402', '228403', '228411', '228429', '228433', '228436', '228440', '228441', '228442', '228443', '228444', '228445', '228461', '228462', '228463', '228477', '228487', '229065', '231365', '253251'] nosy_count = 13.0 nosy_names = ['rhettinger', 'mark.dickinson', 'pitrou', 'scoder', 'eric.smith', 'steven.daprano', 'r.david.murray', 'cool-RR', 'Aaron.Meurer', 'ethan.furman', 'serhiy.storchaka', 'Saimadhav.Heblikar', 'josh.r'] pr_nums = [] priority = 'normal' resolution = 'rejected' stage = 'patch review' status = 'closed' superseder = None type = 'enhancement' url = 'https://bugs.python.org/issue22515' versions = ['Python 3.5'] ```

    06547701-06a2-4e4a-b672-16a33475101e commented 10 years ago

    I don't really understand the refactoring. For example why is len(self) <= len(other) needed? For which case? Note that len would also catch keys with 0 value which shouldn't have an effect.

    serhiy-storchaka commented 10 years ago

    There are some properties of set comparison:

    (a \< b) == (a \<= b and a != b) (a \<= b) == (a \< b or a == b) (a \<= b and b \<= a) == (a == b) (a \< b and b \< a) == False (a \<= b) == (a - b == set()) if (a \<= b and b \<= c) then (a \<= c) (a \<= b and a \<= c) == (a \<= (b & c)) (a \<= c and b \<= c) == ((a | b) \<= c)

    Is this true for Counter?

    06547701-06a2-4e4a-b672-16a33475101e commented 10 years ago

    There are some properties of set comparison:

    (a \< b) == (a \<= b and a != b)

    I don't think so, because counter equality is based on dict equality so it doesn't ignore zero values.

    (a \<= b) == (a \< b or a == b)

    No, ditto.

    (a \<= b and b \<= a) == (a == b)

    No, ditto.

    (a \< b and b \< a) == False

    Yes.

    (a \<= b) == (a - b == set())

    Yes.

    if (a \<= b and b \<= c) then (a \<= c)

    Yes.

    (a \<= b and a \<= c) == (a \<= (b & c))

    Yes.

    (a \<= c and b \<= c) == ((a | b) \<= c)

    Yes.

    stevendaprano commented 10 years ago

    On Sat, Oct 04, 2014 at 07:57:16AM +0000, Serhiy Storchaka wrote:

    There are some properties of set comparison:

    (a \< b) == (a \<= b and a != b) (a \<= b) == (a \< b or a == b) (a \<= b and b \<= a) == (a == b) (a \< b and b \< a) == False (a \<= b) == (a - b == set()) if (a \<= b and b \<= c) then (a \<= c) (a \<= b and a \<= c) == (a \<= (b & c)) (a \<= c and b \<= c) == ((a | b) \<= c)

    Is this true for Counter?

    I believe these are all true for multisets.

    But Counter is not *just* a multiset (a.k.a. bag) since the "counts" are not limited to non-negative integers (the natural numbers).

    py> C = Counter() py> C['a'] = -3 py> C['b'] = 'spam' py> C['c'] = 0.5 py> C Counter({'c': 0.5, 'b': 'spam', 'a': -3})

    I don't know of any standard definition for subset and superset for multisets like C above. I think that we should raise an exception in cases like that: TypeError for cases where any "count" is non-numeric, and ValueError for cases where any count is not a non-negative integer.

    (Later, if somebody comes up with a sensible meaning for sub/superset of such a Counter, we can relax that restriction.)

    Another issue: are missing elements treated as if they have count 0? E.g. what are these?

    Counter({'a': 0, 'b': 1}) \<= Counter({'a': 1, 'b': 1}) Counter({'a': 0, 'b': 1}) \<= Counter({'b': 1})

    According to these pages:

    http://singularcontiguity.wordpress.com/2008/05/07/multiset-equality-and-submultisets/ http://singularcontiguity.wordpress.com/2008/04/28/multiset-membership/

    I would argue that they should both return True (based on the idea that the "support" of a multiset is the set of those elements with non-zero count). But I don't know how authoritative that blog is. (The author even mentions that he only has a passing familiarity with some aspects of multisets.)

    Anyone familiar with Haskell? What does this bag class do?

    http://hackage.haskell.org/package/uulib-0.9.8/docs/UU-DData-IntBag.html

    This may also be helpful. Sets and bags in Z language:

    http://www.cs.ox.ac.uk/people/michael.wooldridge/teaching/soft-eng/lect08.pdf http://www.cs.ox.ac.uk/people/michael.wooldridge/teaching/soft-eng/lect14.pdf

    serhiy-storchaka commented 10 years ago

    At least this condition is false for current implementation:

    (a \< b and b \< a) == False

    Example: a = Counter({'a': -1}), b = Counter({'b': -1}).

    stevendaprano commented 10 years ago

    On Sat, Oct 04, 2014 at 08:38:17AM +0000, Ram Rachum wrote:

    Ram Rachum added the comment:

    > There are some properties of set comparison: > > (a \< b) == (a \<= b and a != b)

    I don't think so, because counter equality is based on dict equality so it doesn't ignore zero values.

    That's a very good point! That suggests that we shouldn't treat Counters as multisets (a.k.a. bags) and define subset/superset operators. They're *almost* multisets, but not quite the same, since:

    Counter({'a': 0}) != Counter({})

    but the formal definition of multiset suggests those should be equal:

    https://en.wikipedia.org/wiki/Multiset#Formal_definition

    Perhaps we should rethink this. I'm starting to wonder whether we need a PEP to define exactly how subset and superset should be defined for Counters. This is precisely the trouble with rushing to add code before we know what the code is supposed to do... :-)

    pitrou commented 10 years ago

    Just because something is discussed doesn't mean it needs a PEP. Also, see http://bugs.python.org/issue22515#msg227852

    06547701-06a2-4e4a-b672-16a33475101e commented 10 years ago

    I see that in my implementation for lt and gt I forgot to check whether the other counter has elements that this counter doesn't, which could flip found_strict_difference.

    stevendaprano commented 10 years ago

    On Sat, Oct 04, 2014 at 10:43:02AM +0000, Antoine Pitrou wrote:

    Just because something is discussed doesn't mean it needs a PEP.

    I know that in general discussions do not need a PEP, but we seem to be rushing awfully fast into writing code before we even know what the code is supposed to do. Perhaps "need a PEP" was too strong, but we surely need to do something to decide on constistent and well-defined behaviour. As you pointed out earlier, Counters aren't precisely multisets, they're more like a hybrid mapping/multiset. It's not clear what behaviour subset and superset should have for such a hybrid. By rushing into code before deciding on what the code is supposed to do, we end up with inconsistent behaviour.

    Here are a pair of Counters which compare "is subset", but not "is subset or equal" using Ram's patch:

    py> Counter(a=1, b=0) \< Counter(a=2) True py> Counter(a=1, b=0) \<= Counter(a=2) False

    On the other hand, here's a pair which compares "is subset or equal", but not ("is subset" or "equal"):

    py> Counter(a=0) \<= Counter(b=0) True py> (Counter(a=0) \< Counter(b=0)) or (Counter(a=0) == Counter(b=0)) False

    Adding elements with count=0 changes whether a set is a subset or not:

    py> Counter(a=0) \< Counter(b=0) False py> Counter(a=0) \< Counter(b=0, c=0) True py> Counter(a=0, b=0) \< Counter(b=0, c=0) False

    I'm not convinced that mixed Counter and regular dict comparisons should be supported, but Ram's patch supports any Mapping type. Is that a good idea? There's been no discussion on that question.

    Can somebody explain to me what this means? How is this meaningfully a subset operation?

    py> Counter(a="eggs") \< Counter(a="spam") True

    I supported adding these operations to Counter, but the last thing I want is to enshrine a set of ad-hoc and inconsistent semantics that don't match standard multiset behaviour and are defined by the implementation, instead of having the implementation defined by the desired semantics.

    pitrou commented 10 years ago

    Le 04/10/2014 17:24, Steven D'Aprano a écrit :

    I know that in general discussions do not need a PEP, but we seem to be rushing awfully fast into writing code before we even know what the code is supposed to do.

    Writing a patch is useful practice because it's a way to discover latent problems (especially when writing unit tests).

    py> Counter(a=1, b=0) \< Counter(a=2) True py> Counter(a=1, b=0) \<= Counter(a=2) False

    That looks wrong to me.

    py> Counter(a=0) \<= Counter(b=0) True py> (Counter(a=0) \< Counter(b=0)) or (Counter(a=0) == Counter(b=0)) False

    Ditto.

    py> Counter(a=0) \< Counter(b=0) False py> Counter(a=0) \< Counter(b=0, c=0) True

    Yuck.

    I'm not convinced that mixed Counter and regular dict comparisons should be supported, but Ram's patch supports any Mapping type. Is that a good idea? There's been no discussion on that question.

    That sounds wrong to me as well.

    Can somebody explain to me what this means? How is this meaningfully a subset operation?

    py> Counter(a="eggs") \< Counter(a="spam") True

    I agree this looks silly. However, the Counter doc says:

    """The multiset methods are designed only for use cases with positive values. [...] There are no type restrictions, but the value type needs to support addition, subtraction, and comparison."""

    Which really sounds like an excuse to not perform any type checking and silently allow nonsensical multiset behaviour with non-integer keys. We can follow that lead :-)

    06547701-06a2-4e4a-b672-16a33475101e commented 10 years ago

    Hi everybody,

    I agree we have a mess on our hands, with lots of cases where comparison is weird. To be honest I've given up on collections.Counter at this point. The fact that it's a "hybrid mapping/multiset" is the cause of all the problems, so I just made my own class that is defined as a multiset with only positive integer values, and I'm going to use that and define methods on it without having to worry about these weird cases. Unless someone else wants to champion this issue and resolve the edge cases, it can be closed as far as I'm concerned.

    I really wish that collections.Counter wouldn't have been added to the standard library as a "hybrid mapping/multiset"; I think that it shows emphasis on the implementation rather than the usage, while the usage is more important. But there's nothing I can do about that. (If there was a place where we can make notes for Python 4, I'd put a note there about it.)

    ethanfurman commented 10 years ago

    We are not "rushing into code", we are getting some code, and tests, written to help define what we are talking about.

    So far the feedback has given some more tests to add, and some things to change (such as only comparing with other Counters).

    The big question I have is how do we handle \<= and >= ? Normal Counter == treats a key with a zero value as being unequeal to a Counter without that same key, but I don't think that makes sense for partial ordering.

    serhiy-storchaka commented 10 years ago

    This implementation obeys more set properties.

    rhettinger commented 10 years ago

    I suggest implementing Counter.__lt__ which will be a partial order, similarly to set.__lt__.

    This would be reasonable if the counter were a strict stand-alone multiset or bag; however, the core design is much looser than that (supporting negative counts for example).

    It would be more accurate to think about the Counter as a dict that returns 0 for missing keys and has few extra methods that can be handy in the content of counting things.

    Accordingly, a partial ordering operator wouldn't be well-defined for some of the intended use cases.

    The addition of subset/superset operations was considered and rejected during the original design phase (the omission was intentional). It is better to leave this to the user to decide what they want to do and how they intend to do so. After all, it is as easily to work with as a regular dict. You can access it directly or subclass it to fit your own needs.

    The starting point for this class (when first proposed and endorsed by Guido) was:

       class Counter(dict):
           def __missing__(self, key):
               return 0

    That core idea is dirt simple and it is not hard to build your own variants if the one in collections doesn't meet your needs.

    06547701-06a2-4e4a-b672-16a33475101e commented 9 years ago

    Hi everyone,

    Just wanted to follow up here that I've created my own dedicated bag class.

    Documentation: https://combi.readthedocs.org/en/stable/bags.html

    Install using pip install combi

    It's packed with features. There are a ton of arithmetic operations defined and the comparisons methods we talked about in this issue. There are also frozen and ordered variants.

    6fd3d553-ff50-40bd-a9a1-836658cb1a33 commented 9 years ago

    I can't believe this issue was closed. Why can't Counter.__lt__(self, other) just be all(self[i] \< other[i] for i in self)?

    Just because Counter supports weird stuff shouldn't bill it out. To follow that logic, we should also remove Counter.subtract

    >>> Counter(a='b') - Counter(a='c')
    Traceback (most recent call last):
      File "<ipython-input-5-31b2df7f8ff1>", line 1, in <module>
        Counter(a='b') - Counter(a='c')
      File "/Users/aaronmeurer/anaconda/lib/python3.5/collections/__init__.py", line 709, in __sub__
        newcount = count - other[elem]
    TypeError: unsupported operand type(s) for -: 'str' and 'str'

    It's super annoying that Counter supports all the basic set operations *except subset. A reminder that this *does work:

    >>> Counter(a=2) | Counter(a=1)
    Counter({'a': 2})

    And also fails on weird values:

    >>> Counter(a='b') | Counter(a='c')
    Traceback (most recent call last):
      File "<ipython-input-8-f9768ad92117>", line 1, in <module>
        Counter(a='b') | Counter(a='c')
      File "/Users/aaronmeurer/anaconda/lib/python3.5/collections/__init__.py", line 730, in __or__
        if newcount > 0:
    TypeError: unorderable types: str() > int()