python / cpython

The Python programming language
https://www.python.org
Other
63.62k stars 30.48k forks source link

memoryview is a Sequence but does not implement the full Sequence API #125420

Open JelleZijlstra opened 1 month ago

JelleZijlstra commented 1 month ago

Bug report

Bug description:

The memoryview builtin is registered as a Sequence, but doesn't implement the full API, because it doesn't inherit from Sequence at runtime and doesn't implement the mixin methods.

>>> import collections.abc
>>> issubclass(memoryview, collections.abc.Sequence)
True
>>> collections.abc.Sequence.index
<function Sequence.index at 0x10148d250>
>>> memoryview.index
Traceback (most recent call last):
  File "<python-input-4>", line 1, in <module>
    memoryview.index
AttributeError: type object 'memoryview' has no attribute 'index'

Among the methods listed for Sequence in the documentation, memoryview is missing __contains__, __reversed__, index, and count. This is causing problems for typing; in the typeshed stubs we have to either lie one way by claiming that memoryview has methods it doesn't have, or lie another way by claiming it is not a Sequence when issubclass() says otherwise at runtime (python/typeshed#12800). To fix this, we should either make memoryview not a Sequence, or add the missing methods. The former has compatibility implications, so I propose to add the methods in 3.14.

CPython versions tested on:

CPython main branch

Operating systems tested on:

No response

Linked PRs

JelleZijlstra commented 1 month ago

cc @rhettinger for ABCs and @AlexWaygood for prior discussion on typeshed.

picnixz commented 1 month ago

I think it would be nice to have such addition. If we were to add them in 3.14, I'd like to help! (at least with one method; I think it's better if we don't add them all in one go).

JelleZijlstra commented 1 month ago

Go ahead, I haven't started coding yet.

ZeroIntensity commented 1 month ago

IIRC, the sequence protocol on memoryview works in the C API. You guys should be able to just use those functions to implement the methods.

picnixz commented 1 month ago

By the way:

>>> reversed(memoryview(b'a'))
<reversed object at 0x7f6ece3fd8b0>

works and we actually have tests for that. However, I think there is somewhere an implicit cast or a generic reversed that is being used via indices + length maybe? So I don't know whether we should have a dedicated reversed method or just dispatch to the generic implementation.

JelleZijlstra commented 1 month ago

Yes, reversed() works through len and getitem lookup. It may be worth adding that method if it makes reversing more efficient, but I'd also be OK with skipping that one and using some hacks in the stub; after all, it's unlikely people are calling __reversed__ directly.

__contains__ is in a similar situation; in already works on memoryviews. However, there it's likely that a contains implementation written directly for memoryview is significantly more efficient.

picnixz commented 1 month ago

Well... the __contains__ I used is just by iterating over the object for now. I actually haven't checked that __contains__ worked beforehand. I can definitely make it more efficient though if performances matter (it's just that the __contains__ does not exist per se).

ZeroIntensity commented 1 month ago

I took a quick glance at the PRs and they seem fine to me, but I would be very cautious about making sure that e.g. memoryview.index does the same thing as PySequence_Index, otherwise you're bound to break untold amounts of code relying on the buffer protocol.

JelleZijlstra commented 1 month ago

That's a good point (though I'm not sure there are untold amounts of code using PySequence_Index on memoryviews). PySequence_Index delegates to _PySequence_IterSearch in abstract.c, which works similar to the implementation in Bénédikt's PRs.

picnixz commented 1 month ago

... I forgot to finish the multi-dimensional case. And the fact that it's not only basic memoryviews that should be supported but the entire protocol... I'll need to check the multi-dimensional case carefully.

you're bound to break untold amounts of code relying on the buffer protocol.

(I'll try passing on that achievement)

JelleZijlstra commented 1 month ago

It appears iteration on multidimensional memoryviews currently isn't implemented at all, so we don't need to support it for these new methods either.

picnixz commented 1 month ago

I've implemented the missing (naive) interface. The performances should be better than just the generic implementations since we have less checks but some methods can be improved by directly accessing the underlying buffer instead of creating iterators.

If such performance gain is needed, we'll address those concerns in a separate issue.

ZeroIntensity commented 1 month ago

I can look closer at the PRs sometime today. I have a few general comments at a quick glance:

picnixz commented 1 month ago

Performances should really be addressed separately. We try to avoid using the private API if this is not performance critical or if it is exposed "almost directly" to the user. Before using the private API, we need benchmarks but I think it's first better not to do it in those PRs. We will definitely gain performances by directly accessing the buffers but those should be investigated.

For assertions, which place do you have in in mind? I can find some in the __reversed__ implementation since I'm making assumption on the inputs, so I can update it later but otherwise I can't find other places where assertions need to be required.

ZeroIntensity commented 1 month ago

When writing C code, I like to put assertions everywhere and anywhere. They add no overhead at runtime (they get compiled out on release builds) and make debugging a lot easier. Generally, if you make an assumption, add an assertion.

JelleZijlstra commented 1 month ago

The performances should be better

That's a dangerous statement; it would be good to run some microbenchmarks on operations like x in memoryview(), reversed(memoryview()).

I am going to wait a bit before merging any of the PRs here in case other core devs want to voice their opinion on whether this is a good idea.

rhettinger commented 1 month ago

I'm conflicted about this one.

On the plus side, this makes the APIs more consistent and will simplify the typeshed stubs.

On the minus side, it feels a bit off to implement (and have to test and maintain) an actual concrete implementation for functionality that no user has ever needed. It seems a little like the tail wagging the dog. In the past, we've usually applied a principle of parsimony and resisted API expansions until good use cases have arisen. That has helped avoid cruft accumulating for features than never get used.

picnixz commented 1 month ago

On the plus side, we could also benefit from possible improvements though that's just a gut feeling.

For instance materializing the reverse iterator is now slower but iterating over it is faster if the buffer is large enough.

This could also serve for the multidimensional case but I have no idea if this is a feature that we plan to have or not. The APIs were simple enough to implement so it didn't bother me btw.

JelleZijlstra commented 1 month ago

I feel that we should add count and index. memoryview is registered as a Sequence, and Sequence is documented as having those methods, so to me it seems like a bug that the methods don't exist on memoryview.

I don't feel as strongly about the missing dunders, __contains__ and __reversed__. Adding them makes the typing a bit more tidy, but we could manage without them. Adding those methods may produce performance wins, but there probably isn't a lot of real code that relies on reversing memoryviews or doing containment checks on them.

picnixz commented 1 month ago

Yeah index and count seem important since they are public and documented. For contains and reversed I think we can just rely on the default dunders.

I'm actually wondering but does "being a sequence" means "having a __contains__ method" or does it mean "x in seq" works? or is it not detailed on purpose?