Closed hexiaoting closed 4 years ago
We don't have a "group by" operation (Awkward is NumPy-like, rather than Pandas-like), though it wouldn't be a bad idea to add one. First, though, we'd have to have a concept of equality for arbitrary structures: if you want to group by some part of a data structure whose type is records or lists, rather than numbers or strings, we'd need something to tell us whether they're the same or not the same. I think such a definition would be unambiguous, though: records have to have the same fields and the same values in all fields (recursively), and lists have to have the same numbers of elements and the same values in each element (recursively). Any missing values must match and indirection has to pass through, so that an IndexedArray on one side can be equal to a non-IndexedArray on the other side as long as the IndexedArray rearranges elements to match those on the other side. It could be hard to turn an equality definition in Python into an efficient search for unique elements (keys of the group-by), so it might need to be defined in C++. I don't see a good way to vectorize it. If a first pass labeled all fields of record-type fields as unique integers, we'd them have to find unique sets of these fields: I just see a lot of intermediate arrays if this is vectorised, so much so that it could reduce performance, rather than improve it.
That's not a small project and wouldn't be available right away, so if you're looking for a solution to group by of JSON right now, maybe try jq, JSONiq, jmespath, JaQL, JsonPath, json-path, ... There are a lot of languages that seek to be the SQL of JSON. Awkward Array seeks to be the NumPy of JSON.
Out of curiosity, what problem are you trying to solve?
Since this is a bit outside Awkward Array's scope and I haven't heard any follow-up, I'm going to close this issue.
PR #733 adds a low-level primitive that enables "group by" operations. Primarily, it's for many-nested "group by," as most things in Awkward Array are.
From the documentation (not yet online):
I want to get group by results on json data. How can I do it with awkward-array?