python / typeshed

Collection of library stubs for Python, with static types
Other
4.35k stars 1.74k forks source link

Many functions that accept Iterable should actually accept Iterable | SupportsGetItem #7813

Closed matangover closed 2 years ago

matangover commented 2 years ago

Many functions in stdlib that are documented to accept an "iterable" are annotated as Iterable[T]. This means the arguments must have __iter__. However, the actual functions don't require the arguments to have __iter__, they also work with arguments that have only __getitem__. That's because internally they use iter (in Python) or PySequence_Fast (in C).

As the docs for both iter and PySequence_Fast functions show, these functions support two protocols for the argument: the "iterable" protocol (__iter__) and the "sequence" protocol (__getitem__).

Example functions that have this bug right now:

enumerate
iter
str.join
bytes.__new__
bytes.join

For enumerate, a bug was actually filed on the mypy repo but in fact I believe this should be corrected in typeshed.

Aside: for loops also similarly accept "iterables" that have only __getitem__, but that is implemented inside type checkers themselves. For now mypy still doesn't support it properly, but Pyright does.

hauntsaninja commented 2 years ago

Related https://github.com/python/mypy/issues/2220

matangover commented 2 years ago

I suggest adding to _typeshed an alias AnyIterable: TypeAlias = Union[Iterable | SupportsGetItem]

matangover commented 2 years ago

Or, well, since SupportsGetItem has two type parameters that won't exactly work, but something along those lines.

JelleZijlstra commented 2 years ago

This affects every single place where Iterable is accepted as an argument, except in the unlikely case that the code directly calls .__iter__.

We could add an alias like you suggest and use it everywhere, but that doesn't seem like a great solution. The alias wouldn't get used in code outside typeshed, it would be harder for users to pick up this special case, and type checker error messages may get worse.

matangover commented 2 years ago

The alias wouldn't get used in code outside typeshed, it would be harder for users to pick up this special case

I don't think it's common to call iter or PySequence_Fast on function arguments in "user" code... At least I never do that in my code. Do you suggest to add a new type to typing_extensions or typing?

type checker error messages may get worse

True. But in a minor way. It would probably only change from showing "Iterable" to "SequenceOrIterable". Which is in fact more correct and less misleading.

matangover commented 2 years ago

In the end, the job of typeshed is to correctly annotate the stdlib

JelleZijlstra commented 2 years ago

I don't think it's common to call iter or PySequence_Fast on function arguments in "user" code... At least I never do that in my code. Do you suggest to add a new type to typing_extensions or typing?

Anything useful you do with an iterable will ultimately call one of those functions (or some other part of the Python implementation that supports the __getitem__-based protocol).

matangover commented 2 years ago

Right. So maybe indeed the right thing is to add it to typing or collections.abc. Maybe we could fix typeshed's annotations first, and later if a new type is introduced to Python it can be adopted.

TeamSpen210 commented 2 years ago

I don't think typeshed needs changes here, the problem is that Iterable has special additional behaviour not representable in the stub. It'd need to be handled specially in the type checker.

matangover commented 2 years ago

I disagree, the type annotations are wrong. They specify the argument has to have __iter__ but it's not true. This is really easy to fix in typeshed.

matangover commented 2 years ago

special additional behaviour not representable in the stub

Wait what behavior do you mean? Apart from having __getitem__?

JelleZijlstra commented 2 years ago

The behavior is representable in a stub. Iterable would be defined something like this:

class _IterIterable(Protocol[_T_co]):
    def __iter__(self) -> Iterator[_T_co]: ...
class _GetItemIterable(Protocol[_T_co]):
    def __getitem__(self, __idx: int) -> _T_co: ...
Iterable: TypeAlias = _IterIterable[_T_co] | _GetItemIterable[_T_co]

But I still don't like this change much. Suppose we add a typing_extensions.RealIterable defined like the above. And then:

And all that for an edge case so rare that it's only been reported a handful of times in mypy's years of history.

TeamSpen210 commented 2 years ago

Even that's not exactly correct, since it doesn't do the __iter__ = None behaviour, and since unions aren't ordered this is ambiguous with conflicting getitem/iter definitions - mappings for instance. It's just another case where magic methods have unusual behaviour.

matangover commented 2 years ago

And all that for an edge case so rare that it's only been reported a handful of times in mypy's years of history.

I ran into it with torch.utils.data.Dataset, which implements __getitem__ but not __iter__. It works great with stdlib functions at runtime but generates lots of false type errors. A lot of people use this class.

matangover commented 2 years ago

I really wouldn't mind if the fix is in type checkers instead of in typeshed. But I do think there should be a fix :) Personally I think that having a special case for Iterable in all type checkers is an "even worse" solution.

@erictraut maybe you could have an opinion or suggestion here

erictraut commented 2 years ago

The fix here cannot (or rather should not) be provided by type checkers. If this problem needs to be fixed, it should be done so in the stubs.

I also question whether this is a problem that needs fixing. As Jelle indicated, this seems like an extreme edge case that rarely comes up. Perhaps we'd be better off documenting a workaround?

In addition to the downsides that Jelle mentioned, I'll add one more: replacing Iterable with a union of two generic protocol types will likely have a measurable impact on type checking performance for many code bases. This is such a core type that is used in so many common stdlib methods, and forcing type checkers to evaluate two protocols instead of one will likely add to analysis times.

matangover commented 2 years ago

Thanks for weighing in, good points

Perhaps we'd be better off documenting a workaround?

Do you mean implementing __iter__? (i.e. in this case, the developers of PyTorch adding Dataset.__iter__)

erictraut commented 2 years ago

Do you mean implementing __iter__?

Or wrapping it in a call to the list constructor. I realize this has a bit of runtime overhead, but in most cases where the number of elements is small, that's a reasonable workaround.

matangover commented 2 years ago

Or wrapping it in a call to the list constructor

Definitely not an option for Dataset

hauntsaninja commented 2 years ago

I opened a PR against torch that might help :-) https://github.com/pytorch/pytorch/pull/77116

matangover commented 2 years ago

Here's another possible workaround, assuming I can't modify original library's code (as @hauntsaninja did), and I want to have a function accepting any iterable (e.g. pytorch Dataset and other "normal" iterables), with only a thin wrapper without coercing to list:


from typing import Generic, Iterable, Iterator, Protocol, TypeVar, Union
from typing_extensions import TypeAlias

T_co = TypeVar("T_co", covariant=True)
class GetItemIterable(Protocol[T_co]):
    def __getitem__(self, __idx: int) -> T_co: ...
AnyIterable: TypeAlias = Union[Iterable[T_co], GetItemIterable[T_co]]

class GetItemIterableWrapper(Generic[T_co]):
    def __init__(self, iterable: GetItemIterable[T_co]):
        self.iterable = iterable
    def __iter__(self) -> Iterator[T_co]:
        return iter(self.iterable) # type: ignore

def to_iterable(x: AnyIterable[T_co]) -> Iterable[T_co]:
    return x if isinstance(x, Iterable) else GetItemIterableWrapper(x)

# Example use:

class MyIterable:
    def __getitem__(self, idx: int) -> str:
        if idx >= 3:
            raise IndexError()
        return "blah"

def strjoin1(sep: str, iterable: AnyIterable[str]):
    # The following line gives type error (but works at runtime)
    return sep.join(iterable)

def strjoin2(sep: str, iterable: AnyIterable[str]):
    # The following line gives no error
    return sep.join(to_iterable(iterable))

print("Testing __iter__ iterable:")
print(strjoin1(" ", ["1", "2", "3"]))
print(strjoin2(" ", ["1", "2", "3"]))

print("\nTesting __getitem__ iterable:")
print(strjoin1(" ", MyIterable()))
print(strjoin2(" ", MyIterable()))
hauntsaninja commented 2 years ago

Hmm, I guess I would be in favour of a single typeshed change: allow iter to accept a GetItemIterable. Then you could use iter(MyIterable()) in places that expect an Iterator. Hopefully in most situations this is most of what you need and you can get away without the friction of defining your own protocols and helpers as above.

matangover commented 2 years ago

Yes, that would remove the need for type: ignore in my example. The rest would still be needed.

matangover commented 2 years ago

Oh, you are totally right, it isn't! Because an iterator is also iterable...

hauntsaninja commented 2 years ago

Hopefully https://github.com/python/typeshed/pull/7817 will go some way towards making things more ergonomic :-)

AlexWaygood commented 2 years ago

Thanks for the report, and I agree that this is one of 9857 things that are slightly annoying about Python's typing system. I think I speak for my fellow maintainers, though, when I say that we're not going to make the change you're asking for here. While it's true that objects with __getitem__ but not __iter__ do support iteration, I think that should properly be viewed as an accident of history rather than something we should bend over backwards to account for in typeshed. If a class is meant to be iterable, it should generally define __iter__.

I'm glad @hauntsaninja's ingenious #7817 was merged, and I hope his pytorch PR is also merged :)

matangover commented 2 years ago

Thanks all for your insightful comments, I totally agree with what you said @AlexWaygood. I think the change applied by @hauntsaninja is an excellent workaround. One thing that could be done is to document this somewhere, but seems like there is no real "Python typing docs" place, only the type checkers document this individually

JelleZijlstra commented 2 years ago

We could put up an essay on typing.readthedocs.io, similar to https://typing.readthedocs.io/en/latest/source/unreachable.html about exhaustiveness checking.