tradewelltech / protarrow

Convert from protobuf to arrow and back
https://protarrow.readthedocs.io/
Apache License 2.0
17 stars 3 forks source link

recursive/self-referential types leads to a recursion error during schema generation #85

Open alkasm opened 2 days ago

alkasm commented 2 days ago

Currently it isn't possible to use protarrow with a protobuf that is self-referential. For example, neither of these two protos will work:

message LinkedListNode {
  string value = 1;
  LinkedListNode next = 2;
}

message TreeNode {
  string value = 1;
  repeated TreeNode children = 2;
}

The recursive field getter unsurprisingly hits an infinite recursion error when creating the schemas.

My understanding is that parquet does not support this type of self-referential column either.

I think the common way to deal with converting recursive nested structures into a flat schema is to introduce a "max depth", where the schema will drop the field after N levels, which could be parametrizable. Does this seem like a reasonable idea to you?

0x26res commented 2 days ago

I don't think we'd be able to support this. Protarrow can only work with types that are supported by both Apache Arrow and Protobuf.

Your suggestion to use a fixed max depth would create very wide table Arrow tables, that would be very hard to use in practice. I think it would make more sense to have a schema where each Node points to the ID of other nodes, rather than their value.

alkasm commented 1 day ago

Your suggestion to use a fixed max depth would create very wide table Arrow tables, that would be very hard to use in practice.

Agreed if you wanted max depth to be large, but I think in practice this is usually something way more reasonable, like max_depth=3. I guess the example I gave of linked lists and trees is not the best since those structures are probably way more deep generally. But I think in practice at the API level, and with the actual edge case that I hit, a low number is likely acceptable.

I think it would make more sense to have a schema where each Node points to the ID of other nodes, rather than their value.

I'm not sure if you're suggesting this is something protarrow could do to solve the problem, or if you're saying the protobuf schema should change? From my experience most protobuf -> arrow/parquet use-cases are going to be ones that are far downstream of the definition of the protos.

Here's a related issue on apache parquet-java btw: https://github.com/apache/parquet-java/issues/1642