Invalid Python codegen when record contains aliased nullable union type

johnstairs commented 8 months ago

Given the following model on commit 42be458d779dc0eb0185b6c5b74278a7d06bfa7b:

X: [null, int, float]

MyRec: !record
  fields:
    a: X

We get the following exception when importing the generated code:

Traceback (most recent call last):
  File "/workspaces/yardl/python/run_sandbox.py", line 5, in <module>
    import sandbox
  File "/workspaces/yardl/python/sandbox/__init__.py", line 21, in <module>
    from .types import (
  File "/workspaces/yardl/python/sandbox/types.py", line 121, in <module>
    get_dtype = _mk_get_dtype()
                ^^^^^^^^^^^^^^^
  File "/workspaces/yardl/python/sandbox/types.py", line 117, in _mk_get_dtype
    dtype_map.setdefault(MyRec, np.dtype([('a', get_dtype(typing.Optional[X]))], align=True))
                                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspaces/yardl/python/sandbox/_dtypes.py", line 87, in <lambda>
    return lambda t: get_dtype_impl(dtype_map, t)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspaces/yardl/python/sandbox/_dtypes.py", line 60, in get_dtype_impl
    return _get_union_dtype(get_args(t))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspaces/yardl/python/sandbox/_dtypes.py", line 81, in _get_union_dtype
    inner_type = get_dtype_impl(dtype_map, args[0])
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspaces/yardl/python/sandbox/_dtypes.py", line 76, in get_dtype_impl
    raise RuntimeError(f"Cannot find dtype for {t}")
RuntimeError: Cannot find dtype for <class 'sandbox.types.X'>

Another problem is that the dtype for [null, int, float] should be np.object_, instead of {"has_value": np.bool, "value": np.object_}

Another problem is that the generated union classes that do not have a named type (e.g. Int32OrString) are not recognized by get_dtype() and throw.

naegelejd commented 8 months ago

Another problem is that the dtype for [null, int, float] should be np.object_, instead of {"has_value": np.bool, "value": np.object_}

I see that's the intended behavior, but why is that?

We currently convert yardl type [null, int, float] into Python type typing.Optional[Int32OrString], so I would expect the dtype to be {"has_value": np.bool, "value": np.object_}

johnstairs commented 7 months ago

I see that's the intended behavior, but why is that?

When the value is a python object pointer, there isn't much to be gained by wrapping it in a has_value/value structured array, since the pointer can just be None. But I realize we do this in other cases, for instance an optional string.

microsoft / yardl

Invalid Python codegen when record contains aliased nullable union type #99