Open alkasm opened 2 days ago
It seems like the mask
parameter is optional from the pyarrow docs: https://arrow.apache.org/docs/python/generated/pyarrow.StructArray.html#pyarrow.StructArray.from_arrays
and the default is None
: https://github.com/apache/arrow/blob/ea9b15ff941e7492e171cffee05af85b99306631/python/pyarrow/array.pxi#L4021
so probably can just use None
instead of pa.repeat(False, len(messages))
either way?
But glancing at some of the implementation with the iterable getters, it's not clear to me if a general Iterable[M]
is actually OK, or if the library specifically needs the whole thing in memory (to iterate over the messages multiple times).
Edit: Yeah from what I can tell you need to pass the whole list in memory so it actually needs to be a Sequence[M]
Hi @alkasm, thanks for creating this issue, I think there's a few things here:
__get_item__
/ random access, we only iterate through the data.pa.StructArray.from_arrays
to pass None when there's no mask, then we should get around the requirements of having a Sized
and an Iterable
.hmm I don't think Iterable
implies you can iterate over it many times - iterators and generators, including infinite and non-re-entrant ones, are also Iterable
.
Agreed though that Sequence
has more restrictions than necessary since you don't need __getitem__
. Collection
is probably closest (you don't need __contains__
but at least it won't allow an Iterator
). Here's a very relevant discussion from typing asking for an Iterable
that is not an Iterator
: https://github.com/python/typing/issues/1319
Guido's suggestion is just to use a more concrete type like we've mentioned here.
I guess there isn't a type hint that works for what we need in protarrow
.
Technically the library only calls __iter__
on the input. And it calls it multiple times.
It only calls __len__
in a very narrow use case. If you have a list of google.protobuf.empty_pb2.Empty
(or any custom empty message). For any other use case __len__
should not be called.
I went ahead and narrowed the use of __len__
to this specific case and ignored the type check when the call to __len__
happens https://github.com/tradewelltech/protarrow/pull/82.
We could change from Iterator
to Collection
, but in this case the type hint would say that we can't use the library with KeysView
and ValueView
, which we can.
I'm not sure what's best when putting type hints on the input:
__len__
on the inputBut maybe the bigger problem is that the library doesn't make it clear that the input has to fit into memory already.
It only calls len in a very narrow use case. If you have a list of google.protobuf.empty_pb2.Empty (or any custom empty message). For any other use case len should not be called.
Hm I'm not sure if that's true? I am not using an empty message and I hit this case. One potentially important point with that---I was using dynamic protobufs, i.e. types created at runtime from the file descriptors.
I'm not sure what's best when putting type hints on the input:
- be too restrictive, ie requiring Collection
- not be restrictive enough, ie not documenting the fact that in some very narrow use case, we will call len on the input
Since this is for type checking specifically (and doesn't prevent passing e.g. KeysView
at runtime) I think the general practice is to be more conservative - if my code type checks, I expect it to not fail at runtime. But if I know it actually has an expanded capability at runtime, I can always # type: ignore
| Hm I'm not sure if that's true? I am not using an empty message and I hit this case. One potentially important point with that---I was using dynamic protobufs, i.e. types created at runtime from the file descriptors.
That's true with the latest code which hasn't been released yet. https://github.com/tradewelltech/protarrow/blob/0458ba6dca84ea37becccbf7c8197b658c9971b6/protarrow/proto_to_arrow.py#L527
Iterables can be infinite/don't necessary have a length, so this line of code invalidates the annotation. Length requires
Sized
. Could also useCollection[M]
orSequence[M]
though those are a tad less generic.collections.abc.Collection
Or, maybe the length can be circumvented.
Either way, was surprised to get a runtime error here - is this project type-checked in CI?