python / cpython

The Python programming language
https://www.python.org
Other
63.51k stars 30.42k forks source link

PyMapping_Check returns 1 for list #90534

Closed 744d82b5-30ef-4294-8536-1bb4ff002786 closed 2 years ago

744d82b5-30ef-4294-8536-1bb4ff002786 commented 2 years ago
BPO 46376
Nosy @birkenfeld, @rhettinger, @pitrou, @benjaminp, @encukou, @methane, @bukzor, @aganders3, @serhiy-storchaka, @ilevkivskyi, @bharel, @miss-islington, @brandtbucher, @aviramha, @bentheiii
PRs
  • python/cpython#30600
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields: ```python assignee = None closed_at = None created_at = labels = ['expert-C-API', 'type-feature', '3.11'] title = 'PyMapping_Check returns 1 for list' updated_at = user = 'https://github.com/aviramha' ``` bugs.python.org fields: ```python activity = actor = 'avrahami.ben' assignee = 'none' closed = False closed_date = None closer = None components = ['C API'] creation = creator = 'aviramha' dependencies = [] files = [] hgrepos = [] issue_num = 46376 keywords = ['patch'] message_count = 23.0 messages = ['410555', '410556', '410557', '410564', '410566', '410585', '410618', '410619', '410621', '410622', '410624', '410626', '410627', '410629', '410632', '410633', '410944', '410949', '410951', '411363', '411387', '411422', '411568'] nosy_count = 15.0 nosy_names = ['georg.brandl', 'rhettinger', 'pitrou', 'benjamin.peterson', 'petr.viktorin', 'methane', 'bukzor', 'aganders3', 'serhiy.storchaka', 'levkivskyi', 'bar.harel', 'miss-islington', 'brandtbucher', 'aviramha', 'avrahami.ben'] pr_nums = ['30600'] priority = 'normal' resolution = None stage = 'patch review' status = 'open' superseder = None type = 'enhancement' url = 'https://bugs.python.org/issue46376' versions = ['Python 3.11'] ```

    744d82b5-30ef-4294-8536-1bb4ff002786 commented 2 years ago

    This is re-open of https://bugs.python.org/issue5945. In the former issue, it was decided that documenting the odd behavior and later providing clean, good ABC C API would be the long term solution. Few years passed and there are no alternatives AFAIK I am wondering if that's due to lack of resources or just forgotten? I don't mind contributing the necessary change in case there's really nothing in progress and it is something Python community wants to fix. Hopefully there's an existing solution and I just didn't search well enough.

    744d82b5-30ef-4294-8536-1bb4ff002786 commented 2 years ago

    https://github.com/PyO3/pyo3/issues/2072 Relevant discussion in PyO3 related issue

    744d82b5-30ef-4294-8536-1bb4ff002786 commented 2 years ago

    Would checking the TPFLAGS for Py_TPFLAGS_SEQUENCE & Py_TPFLAGS_MAPPING when using PySequence_Check & PyMapping_Check be a valid fix?

    9326c04b-4d37-4a35-bdd5-d40f5f681a02 commented 2 years ago

    Up until now, trying to distinguish between actual sequences and mappings in C-API was a pain. Same methods are implemented in customer user classes, and the ABCs could have either been registered dynamically or had an appropriate __subclasshook. On top of that, making the same checks as isinstance regarding the whole ABC and __mro chain would considerably slow down the code.

    The new tp_flags are set in both registered abc classes, and subclasses of the appropriate types. As tp_flags exists, we don't even need to account for tp_as_mapping and tp_assequence. According to the docs "if such a flag bit is clear, the type fields it guards must not be accessed and must be considered to have a zero or NULL value instead.". User made classes that don't inherit from the Sequence and Mapping ABCs are not considered either, and there is no \_subclasshook__ for these ABCs so inheritance is fully enforced.

    As we cover builtins, c-extension classes and user-made Python classes, checking tp_flags is guaranteed to be accurate for distinguishing between mapping and sequence types, and would be as fast as a pointer dereference can be.

    Modifying the current PySequence_Check and PyMapping_Check might be a breaking change, especially on a C-API level. However one can argue that checking for a mapping and expecting a failure on user-made mappings counts as a bug. Having them accurately return the correct type in a fast manner using tp_flags is therefore an acceptable breaking change.

    9326c04b-4d37-4a35-bdd5-d40f5f681a02 commented 2 years ago

    Do note, the relevant functions are in the Stable ABI, and their promise will slightly change, yet modifying the current functions instead of creating new ones may still be beneficial.

    744d82b5-30ef-4294-8536-1bb4ff002786 commented 2 years ago

    I submitted a draft patch. Using TPFlags alone doesn't cut it as some types are excluded (bytes, str, bytearray) in sequence and same for mapping. I'm thinking of checking for those cases specifically as those are very very specific casings. Would love some input.

    9326c04b-4d37-4a35-bdd5-d40f5f681a02 commented 2 years ago

    Another question we should ask is about duck typing. Is a sequence which doesn't inherit from abc.Sequence considered a sequence? Whatever the answer is, PySequence specifically looks for a sequence and removes duck typing out of the picture. The object will not pass static typing and will not pass isinstance check, so there's no reason for it to pass PySequence.

    rhettinger commented 2 years ago

    It changes behavior for objects not being iterable/sequences if not inheriting from abc.collections.Sequence.

    This would be a breaking change (for example, it broke the long stable sre_parse code).

    The utility of PySequence_Check and PyMapping_Check is already so low that it isn't work breaking other things to just to marginally improve these two minor and rarely used functions.

    These functions will never be fully reliable. The documentation explains that in general, we can't tell if a class with __getitem__ is a mapping or a sequence. Sometimes hints are present (such as the tp_flags), but the can't get a reliable result. As Guido observed, "calling PyMapping_Check() was never particularly reliable, and extension modules depending on it probably always had subtle bugs." That was true in 2011 and it is still true today.

    I recommend closing this. These functions are mostly unimportant and unreliable and cannot be made correct. In contrast, iterability is important and needs to be stable. Special cases aren't important enough to break the rules.

    rhettinger commented 2 years ago

    s/it isn't work breaking other things/it isn't worth breaking other things/

    methane commented 2 years ago

    collections.abc.Mapping is fixed by https://bugs.python.org/issue43977 We can be same thing if backward compatibility allows it.

    9326c04b-4d37-4a35-bdd5-d40f5f681a02 commented 2 years ago

    I thought about it, what about simply excluding TPFLAGS_MAPPING from PySequence and TPFLAGS_Sequence from PyMapping? It will remove the types that are 100% not sequences or mappings, as these flags are mutually exclusive by definition. The result will be much more accurate, yet not cause a breaking change, apart from places where it is truly not a sequence or mapping.

    rhettinger commented 2 years ago

    what about simply excluding TPFLAGS_MAPPING from PySequence and TPFLAGS_Sequence from PyMapping? It will remove the types that are 100% not sequences or mappings, as these flags are mutually exclusive by definition.

    This is more plausible than the proposed breaking change.

    The result will be much more accurate

    If they can't be made fully reliable, why would we ever recommend that someone use these functions in real code? They can be made to guess better than they guess now, but there is still guesswork.

    ISTM developers should follow the structure pattern matching implementation and refuse the temptation to guess. If a class declares itself as a mapping or sequence, that is reliable information. In contrast, these functions attempt to divine meaning in the absence of a clear declaration. Using these functions will likely result in subtle bugs.

    Once Py_TPFLAGS_MAPPING and Py_TPFLAGS_SEQUENCE became available, we should have deprecated these functions. No one should use them anymore. Their core design is flawed; they tried to deduce semantics from structural artifacts.

    rhettinger commented 2 years ago

    Here's question to focus on: In what circumstances should a developer ever prefer PyMapping_Check() over PyType_HasFeature() with Py_TPFLAGS_MAPPING?

    The latter doesn't give any new, useful information:

    return o && Py_TYPE(o)->tp_as_mapping && Py_TYPE(o)->tp_as_mapping->mp_subscript;

    I don't see any reason to build on top of this. It's best to just let it go gently into the good night without disrupting anything that currently happens to work.

    744d82b5-30ef-4294-8536-1bb4ff002786 commented 2 years ago

    I agree that a developer should and would prefer the Py_TPFLAGS_* but when you visit https://docs.python.org/3/c-api/sequence.html It seems like the best practice to determine Sequence protocol is by using this function, hence leading to confusion. There's no recommendation to use the new Py_TPFLAGS_*. To have this knowledge of Py_TPFLAGS_* one should be very knowledgable in Python's C-API. How about adding a deprecation note to PyMapping_Check & PySequence_Check in the documentation, suggesting the alternative path (to use PyType_HasFeature)?

    9326c04b-4d37-4a35-bdd5-d40f5f681a02 commented 2 years ago

    @rhettinger I completely understand what you're saying and at first I agreed with you. Before I gave it a closer look, I thought about the same thing - we want reliability. Reliability is important and will avoid subtle bugs, which is why I was against this change for the exact reasons you mentioned: it is both breaking and unreliable.

    I then realized that this change can be a reliable replacement for isinstance(obj, collections.abc.Sequence) at the C level. Let's use the broken sre_parse.SubPattern as an example - it does not register or inherit from collections.abc.Sequence, and isinstance(SubPattern, collections.abc.Sequence) == False. We cannot know programmatically if SubPattern is a Sequence, we cannot type hint it as such, and apart from reading the documentation, we cannot deal with the type differently in dynamic code that accepts either sequences or mappings. I dare to say, counting on it being a sequence, especially on a LBYL language like C is even less reliable. While SubPattern "embraces" the spirit of duck typing, it is very hard to fit in light of all recent changes advocating for a more structured and well defined types. After all, this feature was requested in order to solve reliability issues in statically typed languages.

    Putting everything aside, the grand question still remains: do you think that there's a use for an efficient C-API isinstance check for Sequences and Mappings? I would presume the answer is yes. Would we encourage it? I have no clue. But if there's a need, we can either change this function as it has the same "spirit" or introduce a new one to prevent breaking existing code.

    To answer your question: per specification, testing for Py_TPFLAGS_SEQUENCE using PyType_HasFeature, does not take strings, bytes and bytearray into consideration, and will not suffice. It is an incorrect solution that is even less reliable and falls into the exact pitfall of "guesswork" (for example SubPattern currently doesn't work with it either). It is not encouraged or easily thought of. PySequence_Check which is much more intuitive yet doesn't work either and that's where fixing it can have an edge.

    A theoretical PyIsInstance_Sequence can check for TPFLAGS_SEQUENCE and Str/Bytes/ByteArray_Check. If I'm not wrong, doing so will be 100% reliable, identical to isinstance(obj, Sequence), and will be very efficient.

    As a side-note, the C-API documentation for TP_FLAGS is not clear atm. It mentions for example tp_as_sequence and says "if such a flag bit is clear, the type fields it guards must not be accessed and must be considered to have a zero or NULL value instead" yet Py_TPFLAGS_SEQUENCE does not actually coincide with sequences per specification. I know it has a different explanation as well and the flag has its own docstring, but it is still a bit misleading.

    9326c04b-4d37-4a35-bdd5-d40f5f681a02 commented 2 years ago

    That was true in 2011 and it is still true today

    Python's methodology greatly shifted since 2011. For the better or worse, Python lost some of its duck typing. Nowadays, most people use linters. Wherever you'd try to pass sre_parse.SubPattern, the linter will throw an error saying it's not a Sequence even if it fully behaves like one. You can silence that error but you'll continue receiving such warnings all over the code, whether in stdlib or out in the wild. The meaning of Sequence now shifted to "inherits from abc.Sequence". The only thing wrong with PySequence_Check is that the ecosystem shifted, but its view of a Sequence remained the same.

    encukou commented 2 years ago

    Changing the existing functions is a no-go for backwards compatibility reasons. I think the best way forward would be to add a new function, and then possibly deprecate the old one if it's deemed dangerous.

    If you want to push this forward, could you summarize the use case(s) and expected behavior (docs) for such a function? It probably doesn't need a PEP, but it's worth looking here for what to think about when making a proposal: https://www.python.org/dev/peps/pep-0001/#what-belongs-in-a-successful-pep

    744d82b5-30ef-4294-8536-1bb4ff002786 commented 2 years ago

    Sure, I will do so. The proposal should be written here, right?

    encukou commented 2 years ago

    I'd post it to capi-sig, or to the existing thread on python-dev. But here's a good place too, especially if you want feedback from a smaller group of people first.

    744d82b5-30ef-4294-8536-1bb4ff002786 commented 2 years ago

    I sent it to sig-capi - https://mail.python.org/archives/list/capi-sig@python.org/thread/T6DHEKHKKZIYU2GEPGHUQJ3DHTJXZGWW/

    serhiy-storchaka commented 2 years ago

    It is difficult to distinguish Mapping from Sequence because they have the same set of dunder methods. The difference is only in semantic -- __iter yields items for Sequence and keys for Mapping, __getitem gets an item by index (or a sequence by slice) for Sequence and value by key for Mapping. But semantic cannot be tested here.

    In these cases when both Sequence and Mapping are accepted but handled differently (like in the dict constructor) we test for existence of the "keys" method. It is not good, because it is not a dunder method, and therefore should be looked up at instance, not only at a type.

    rhettinger commented 2 years ago

    I would really like this to be left alone. We've been without a reliable version for over a decade. That is strong evidence that we really don't need this unless it can be done perfectly (which it can't).

    Also, PyMapping_Check is just a CPython specific optimization (and a flawed one). Because it can't be made completely reliable, we should not encourage people to use it, nor should we enshrine it in the stable API.

    And if a modification can potentially break long stable code (such as that it the re module), then we absolutely shouldn't do it. x

    Further, there is a general design issue. Abstract base classes were invented to solve this specific problem (distinguishing mappings from sequences). ABCs (and typing) are a now well established practice, and it is foolish to try to do an end run around using them.

    I am strongly opposed to this going forward and request that a PEP be made if it is pursued further. It is an SC level decision to allow stable code to be broken, to guarantee a CPython specific API for something that cannot be made correct in the general case, and to bypass the intended way to do it.

    If the core concern is that isinstance() checks for ABCs are too slow, then efforts should be made to optimize them rather than creating an unreliable, CPython only alternative.

    c22cdacb-6abf-45f9-b781-9862d9df9aff commented 2 years ago

    IMHO, I don't think any alternative to aviramha's solution addresses the issue, And I don't think the need is niche enough to be ignored.

    PyType_HasFeature excludes strings, bytes, and other esoteric types. PyMapping_Check includes mappings like dict and MappingProxyType. PySequence_Check includes non-dict mappings like MappingProxyType.

    The only possible solutions right now are:

    a. Import the "collections.abc.Sequence" class into C and run "PyObject_IsInstance" on it. This would be the correct solution, but it would be slower than aviramha's propsal. b. Perform's aviramha's proposal manually: first check the "Py_TPFLAGS_SEQUENCE" feature flag, and fallback for instance checking strings, bytecodes, and other esoteric types individually. This would be correct and fast, but is cumbersome to perform, and users are bound to forget some types.

    A question as simple as "would isinstance(X, \<Sequence/Mapping>) returns true?" should be easier to answer for low-level developer, one only needs to look at the finagling numpy, among others, has to perform to parse a sequence (the relevant function in numpy is called "PyArray_FromAny").

    A simple implementation of a new function for non-CPython alternatives will be to do implement solution a: import "collections.abc.Sequence", and check for instance. For CPYTHON, this can be optimized by implementing solution b.

    There is already a precedence for ABI functions that only exist because its trivial implementation can be optimized with CPython specific behaviour. I don't understand the reluctance to add this one.

    rhettinger commented 2 years ago

    Marking this as closed: