Python `get_dtype` does not work for vectors, arrays, and maps

microsoft / yardl

Tooling for streaming instrument data

https://microsoft.github.io/yardl/

MIT License

30 stars 5 forks source link

Python `get_dtype` does not work for vectors, arrays, and maps #110

Closed naegelejd closed 10 months ago

naegelejd commented 10 months ago

Using the current test model, yardl throws a RuntimeError: Cannot find dtype for each of the following true assertions:

import test_model as tm

assert tm.get_dtype(tm.AliasedGenericVector[int]) == np.object_

assert tm.get_dtype(tm.AliasedGenericFixedVector[int]) == np.int32

assert tm.get_dtype(tm.AliasedGenericDynamicArray[int]) == np.object_

assert tm.get_dtype(tm.AliasedGenericFixedArray[int]) == np.int32

assert tm.get_dtype(tm.basic_types.AliasedMap[str, int]) == np.object_

naegelejd commented 10 months ago

Here is a related example for non-generic arrays. Given the following model:

IntFixedArray: int[2, 3]

RecordWithFixedArrays: !record
  fields:
    ints: int[2, 3]

RecordWithNamedFixedArrays: !record
  fields:
    ints: IntFixedArray

get_dtype returns the same dtype for both records:

{'names': ['ints'], 'formats': [('<i4', (2, 3))], 'offsets': [0], 'itemsize': 24, 'aligned': True}

However, if I define a non-fixed array type before the fixed array type, like so

IntArray: int[]
IntFixedArray: int[2, 3]

...

then get_dtype returns a new, incorrect dtype for RecordWithNamedFixedArrays:

{'names': ['ints'], 'formats': [('O', (2, 3))], 'offsets': [0], 'itemsize': 48, 'aligned': True}

This extends to vectors as well.

naegelejd commented 10 months ago

At a high level, the get_dtype function should not "install" dtypes for Vectors or Arrays. The issue is that fixed vs non-fixed versions of these compound types both resolve to the same underlying Python type, but have different dtypes. There is no way to distinguish between fixed/non-fixed at runtime.

If we don't "install" a dtype for Named Vector/Array types however, then the get_dtype function can't really be recursive. We have all the type information we need at codegen time to make it non-recursive (EDIT: For non-generic types, that is).

So, using the example above, the get_dtype code might change from

    dtype_map.setdefault(IntArray, np.dtype(np.object_))
    dtype_map.setdefault(IntFixedArray, np.dtype(np.int32))
    dtype_map.setdefault(RecordWithFixedArrays, np.dtype([('ints', np.dtype(np.int32), (2, 3,))], align=True))
    dtype_map.setdefault(RecordWithNamedFixedArrays, np.dtype([('ints', get_dtype(IntFixedArray), (2, 3,))], align=True))

    dtype_map.setdefault(RecordWithFixedArrays, np.dtype([('ints', np.dtype(np.int32), (2, 3,))], align=True))
    dtype_map.setdefault(RecordWithNamedFixedArrays, np.dtype([('ints', np.dtype(np.int32), (2, 3,))], align=True))