noirello / pyorc

Python module for Apache ORC file format
Apache License 2.0
64 stars 21 forks source link

handle uniontype #61

Closed hinxx closed 1 year ago

hinxx commented 1 year ago

I'm trying to use uniontype type. Is this supported?

>>> fp = open("./new_data-6.orc", "wb")
>>> writer1 = pyorc.Writer(fp, "struct<col1:uniontype<int,double>>")
>>> writer1.write((0, 10))
>>> writer1.write((1, 11))
>>> writer1.write((22, 0))
>>> writer1.write((33, 1))
>>> writer1.write((0, 44, 44))
>>> writer1.write((1, 55, 55))
>>> writer1.close()
>>> fp.close()

I can write data but the contents of the file looks wrong when inspected with orc-contents. Only tag 0 has values and tag 1 is always empty..

$ orc-contents new_data-6.orc 
{"col1": {"tag": 0, "value": 0}}
{"col1": {"tag": 0, "value": 1}}
{"col1": {"tag": 0, "value": 22}}
{"col1": {"tag": 0, "value": 33}}
{"col1": {"tag": 0, "value": 0}}
{"col1": {"tag": 0, "value": 1}}

Schema looks OK:

orc-metadata new_data-6.orc 
{ "name": "new_data-6.orc",
  "type": "struct<col1:uniontype<int,double>>",
  "attributes": {},
...
noirello commented 1 year ago

Technically, yes, uniontype is supported, but limited.

The module tries to cast the field value to one of the container types (in your example int and double), and if it fails it tries to convert it to the next type until it succeed or no possible type remains (raising an exception in that case).

Because you only write integers in your example, every value will be an int (tag: 0, the first container type). If you use float values explicitly:

>>> fp = open("./new_data-6.orc", "wb")
>>> writer1 = pyorc.Writer(fp, "struct<col1:uniontype<int,double>>")
>>> writer1.write((0,))
>>> writer1.write((1.0,))
>>> writer1.write((22.0,))
>>> writer1.write((33,))
>>> writer1.write((0,))
>>> writer1.write((1,))
>>> writer1.close()
>>> fp.close()

Then you can see values with tag: 1:

$ orc-contents ./new_data-6.orc
{"col1": {"tag": 0, "value": 0}}
{"col1": {"tag": 1, "value": 1}}
{"col1": {"tag": 1, "value": 22}}
{"col1": {"tag": 0, "value": 33}}
{"col1": {"tag": 0, "value": 0}}
{"col1": {"tag": 0, "value": 1}}

(Side note: you pass tuples with more than one item to the write method in your example, but because your schema only has one column, the rest of the items in the tuple will be thrown away.)

One particular downside of this dynamic casting mechanism that it depends on the order of container type definition. For example if you wrote this schema:

>>> fp = open("./new_data-6.orc", "wb")
>>> writer1 = pyorc.Writer(fp, "struct<col1:uniontype<double,int>>")
>>> writer1.write((0,))
>>> writer1.write((1.0,))
>>> writer1.write((22.0,))
>>> writer1.write((33,))
>>> writer1.close()
>>> fp.close()

You wouldn't be able to write anything as an int (tag: 1), because every Python integer is also a valid float object:

$ orc-contents ./new_data-6.orc
{"col1": {"tag": 0, "value": 0}}
{"col1": {"tag": 0, "value": 1}}
{"col1": {"tag": 0, "value": 22}}
{"col1": {"tag": 0, "value": 33}}
hinxx commented 1 year ago

Thanks for explaining!