pandas-dev / pandas-stubs

Public type stubs for pandas
BSD 3-Clause "New" or "Revised" License
233 stars 123 forks source link

DataFrame initializer does not accept a Generator of a List as data #324

Closed gandhis1 closed 2 years ago

gandhis1 commented 2 years ago

A two-dimensional array (list-of-list) is an acceptable value to pass to the data argument of the pd.DataFrame initializer. This works fine, however when the outer dimension is a Generator and not a List, this does not work. It seems ListLikeU should really be any Iterable (noting that this then would overlap somewhat with the Iterable[Tuple] annotation).

Example:

from typing import Any, List
import pandas as pd

data: List[List[Any]] = [
    [1, 2.0, "something", pd.Timestamp("1/1/2022")],
    [2, 4.0, "other", pd.Timestamp("7/4/2022")]
]
data_gen = (x for x in data)
df = pd.DataFrame(data=data_gen)

Note that actual types here are irrelevant, which is why I manually annotated as Any. The error remains even when you remove this annotation.

Output:

error: Argument "data" to "DataFrame" has incompatible type "Generator[List[Any], None, None]"; expected "Union[Union[Sequence[Any], ndarray[Any, Any], Series[Any], Index], DataFrame, Dict[Any, Any], Iterable[Tuple[Hashable, Union[Sequence[Any], ndarray[Any, Any], Series[Any], Index]]], None]" [arg-type]

bashtage commented 2 years ago

Is a generator really acceptable? I would think that pandas might want to require len on both the inner and outer containers.

gandhis1 commented 2 years ago

The documentation says it supports an Iterable, and isn't a Generator an Iterable? And besides, they don't currently call len (the above code example works).

Dr-Irv commented 2 years ago

You could try adding Iterable to ListLikeU. Only concern is that there might be Iterable objects that would not be acceptable to pandas, and that could be really hard to test.

In other words, let's suppose there is a class that is Iterable, and a user writes code to pass an instance of that class to the DataFrame constructor. It might be the case that pandas would fail using that class, but it would still pass the type checker. That means the type Iterable is too wide for the constructor.

I'm not sure that will happen here, but it is something to be cognizant of, as our testing methodology doesn't pick up cases where the types are too wide.

bashtage commented 2 years ago

Roughly speaking, a generator is used when something iterates and the end condition isn't known. An iterable is when you know how many times the loop will run.

twoertwein commented 2 years ago

A Generator is an Iterator and also an Iterable: https://github.com/python/typeshed/blob/66751e2ebfed2540715426b3d6b2ffb8c8e16b57/stdlib/typing.pyi#L356

Pandas allows almost all iterables for pd.DataFrame(data) but has some prominent exclusions: str, bytes, and sets (excluded in is_list_like).

I think it would definitly be safe to allow Generator for pd.DataFrame as it doesn't include str/bytes/sets.