scikit-hep / awkward

Manipulate JSON-like data with NumPy-like idioms.
https://awkward-array.org
BSD 3-Clause "New" or "Revised" License
819 stars 85 forks source link

Implement __array__ = "sorted_map" for dict-like behavior. #780

Open simonthor opened 3 years ago

simonthor commented 3 years ago

In Python, a key in a dictionary can be any immutable type, e.g. ints and floats. However, it seems like awkward array only supports strings as keys. As an example

ak.Array({1:[1, 2, 3]})

gives the error

RuntimeError: Unable to cast Python instance to C++ type (compile in debug mode for details)

I am guessing that implementing this would require quite a lot of changes to the code, so it might not be feasible but would it be possible to allow other key types than strings, e.g. ints, floats or booleans?

If there are too many problems with implementing this feature, maybe the error message could be changed to something along the lines of

KeyError: keys can only be of type string

when an incorrect key type is passed, as in the example above?

jpivarski commented 3 years ago

This goes against the design of what a record is supposed to be: records are not dicts. Specifically, the set of keys in a dict is an aspect of its value, not its type, but the fields of a record are a part of its type: a record with fields x and y differs as much from a record with fields x, y, and z as bool differs from float. Two dicts with different sets of keys, on the other hand, have the same type, whether we're talking about Python's runtime types or Python code that has been type-checked by mypy. This matters because all items in an array have the same type (though possibly an enumerated union). It would be better to think of Awkward records as a thing like a C struct than a C++ std::map.

So what in Awkward Array is like a std::map? I had to answer that question when reading std::map data from ROOT files in Uproot:

>>> import uproot, skhep_testdata
>>> tree = uproot.open(skhep_testdata.data_path("uproot-stl_containers.root"))["tree"]

>>> tree["map_int32_int16"]
<TBranchElement 'map_int32_int16' at 0x7f4ee84e8280>

>>> tree["map_int32_int16"].typename
'std::map<int32_t, int16_t>'

>>> tree["map_int32_int16"].array()
<Array [[(1, 1)], [(1, 1), ... (4, 4), (5, 5)]] type='5 * [var * (int64, int64),...'>

>>> tree["map_int32_int16"].array().type
5 * [var * (int64, int64), parameters={"__array__": "sorted_map"}]

The primary thing about a mapping (Python dict or C++ std::map) is that it provides a way to quickly find a value corresponding to a unique key. We can represent sets of key-value pairs as sorted lists of 2-tuples in Awkward Array, as though we used std::vector<std::pair<K, V>> in C++. (Fun fact: that's how ROOT serializes std::maps.)

The property of being able to search a mapping quickly is a behavior, and Awkward Array's type system is designed for mere representation, not behavior (just as ROOT I/O's is designed for serialization—hence, they don't care about the difference between a std::map<K, V> and a std::vector<std::pair<K, V>>, either). In Awkward Array, behaviors are overlaid by naming array and record types with parameters and then mixing in class methods from the ak.behavior dict (see documentation). I've set aside the name "sorted_map" to mean "provide a lookup function from sorted keys to values," although it hasn't been implemented. It would be a behavior similar to the one that makes lists of uint8 act like strings:

>>> ak.Array(["This", "is", "an", "array", "of", "strings."]).layout
<ListOffsetArray64>
    <parameters>
        <param key="__array__">"string"</param>
    </parameters>
    <offsets><Index64 i="[0 4 6 8 13 15 23]" offset="0" length="7" at="0x559b66ff90e0"/></offsets>
    <content><NumpyArray format="B" shape="23" data="84 104 105 115 105 ... 105 110 103 115 46" at="0x559b66ff1090">
        <parameters>
            <param key="__array__">"char"</param>
        </parameters>
    </NumpyArray></content>
</ListOffsetArray64>

but for mappings. I could convert this issue into a request to implement "sorted_map", but records will never have field names that are not strings.


For now, though, the way to do that would be to use np.searchsorted to get the closest indexes to a given key, then check to see if the closest is exactly equal, then use those indexes on the values. That technique is used in a few different places in Awkward Array's own implementation, such as this one in ak.unflatten:

https://github.com/scikit-hep/awkward-1.0/blob/5999fbd39793a036802b525f34c1dfb3d43547c2/src/awkward/operations/structure.py#L2052-L2063

The positions are the indexes of inneroffsets where inneroffsets are cloests to outeroffsets (doing side="right" and subtracting 1 ensures that the indexes are all greater than or equal to 0 and less than len(inneroffsets)). Then, if the nearest is also exactly equal—what you want when key-matching—the application inneroffsets[positions] must be equal to outeroffsets everywhere. Think of this inneroffsets as the sorted set of keys in your dict and this outeroffsets as the numbers you're hoping to find in that dict. After verifying that inneroffsets[positions] is equal to outeroffsets everywhere, values[positions] are the matched values (no equivalent in the above example).

simonthor commented 3 years ago

Thank you for the detailed explanation and for providing an alternate solution! Thinking of awkward arrays as structs instead of maps definitely makes it easier to understand.