tradewelltech / protarrow

Convert from protobuf to arrow and back
https://protarrow.readthedocs.io/
Apache License 2.0
17 stars 2 forks source link

Incorrect type annotation for `messages_to_table()`: marked to take an `Iterable[M]` but requires `len()` #81

Open alkasm opened 2 days ago

alkasm commented 2 days ago

Iterables can be infinite/don't necessary have a length, so this line of code invalidates the annotation. Length requires Sized. Could also use Collection[M] or Sequence[M] though those are a tad less generic.

collections.abc.Collection

ABC for sized iterable container classes.

Or, maybe the length can be circumvented.

Either way, was surprised to get a runtime error here - is this project type-checked in CI?

alkasm commented 2 days ago

It seems like the mask parameter is optional from the pyarrow docs: https://arrow.apache.org/docs/python/generated/pyarrow.StructArray.html#pyarrow.StructArray.from_arrays

and the default is None: https://github.com/apache/arrow/blob/ea9b15ff941e7492e171cffee05af85b99306631/python/pyarrow/array.pxi#L4021

so probably can just use None instead of pa.repeat(False, len(messages)) either way?

alkasm commented 2 days ago

But glancing at some of the implementation with the iterable getters, it's not clear to me if a general Iterable[M] is actually OK, or if the library specifically needs the whole thing in memory (to iterate over the messages multiple times).

Edit: Yeah from what I can tell you need to pass the whole list in memory so it actually needs to be a Sequence[M]

0x26res commented 1 day ago

Hi @alkasm, thanks for creating this issue, I think there's a few things here:

alkasm commented 1 day ago

hmm I don't think Iterable implies you can iterate over it many times - iterators and generators, including infinite and non-re-entrant ones, are also Iterable.

Agreed though that Sequence has more restrictions than necessary since you don't need __getitem__. Collection is probably closest (you don't need __contains__ but at least it won't allow an Iterator). Here's a very relevant discussion from typing asking for an Iterable that is not an Iterator: https://github.com/python/typing/issues/1319

Guido's suggestion is just to use a more concrete type like we've mentioned here.

0x26res commented 1 day ago

I guess there isn't a type hint that works for what we need in protarrow.

Technically the library only calls __iter__ on the input. And it calls it multiple times.

It only calls __len__ in a very narrow use case. If you have a list of google.protobuf.empty_pb2.Empty (or any custom empty message). For any other use case __len__ should not be called.

I went ahead and narrowed the use of __len__ to this specific case and ignored the type check when the call to __len__ happens https://github.com/tradewelltech/protarrow/pull/82.

We could change from Iterator to Collection, but in this case the type hint would say that we can't use the library with KeysView and ValueView, which we can.

I'm not sure what's best when putting type hints on the input:

But maybe the bigger problem is that the library doesn't make it clear that the input has to fit into memory already.

alkasm commented 1 day ago

It only calls len in a very narrow use case. If you have a list of google.protobuf.empty_pb2.Empty (or any custom empty message). For any other use case len should not be called.

Hm I'm not sure if that's true? I am not using an empty message and I hit this case. One potentially important point with that---I was using dynamic protobufs, i.e. types created at runtime from the file descriptors.

I'm not sure what's best when putting type hints on the input:

  • be too restrictive, ie requiring Collection
  • not be restrictive enough, ie not documenting the fact that in some very narrow use case, we will call len on the input

Since this is for type checking specifically (and doesn't prevent passing e.g. KeysView at runtime) I think the general practice is to be more conservative - if my code type checks, I expect it to not fail at runtime. But if I know it actually has an expanded capability at runtime, I can always # type: ignore

0x26res commented 21 hours ago

| Hm I'm not sure if that's true? I am not using an empty message and I hit this case. One potentially important point with that---I was using dynamic protobufs, i.e. types created at runtime from the file descriptors.

That's true with the latest code which hasn't been released yet. https://github.com/tradewelltech/protarrow/blob/0458ba6dca84ea37becccbf7c8197b658c9971b6/protarrow/proto_to_arrow.py#L527