scikit-hep / awkward

Manipulate JSON-like data with NumPy-like idioms.
https://awkward-array.org
BSD 3-Clause "New" or "Revised" License
828 stars 85 forks source link

Does awkward want to promote itself yet as a general nested processor? #303

Closed martindurant closed 4 years ago

martindurant commented 4 years ago

Prompted by this stackoverflow question, to which my internal answer was "that's what awkward is for". Is there yet an example of how you might do the circuit of parquet file with complicated schema, to numba-jit function and fast processing.

jpivarski commented 4 years ago

(Yeah, this is really about documentation, because we need an example for that SO question.)

The missing piece is Parquet reading and writing. It was done through pyarrow in the old version and should be again: ak.to_arrow and ak.from_arrow exist, as do PartitionedArray and VirtualArray for lazily loading Parquet columns and row groups. Then #172 would provide some good tests to verify it (as would the old tests).

The Numba JIT-compiling is fairly mature; I've demoed it a number of times, and I'm usually fighting Numba type errors, not Awkward errors/missing features. ArrayBuilder provides an easier way to generate a complex schema'ed output than I've seen elsewhere, though understandably it's not as fast as specialized code. (ArrayBuilder internally has dynamic typing, which is something we'd have to warn against for performance-conscious users.) That would probably be the easiest way to make a Parquet file.

So maybe the right thing to say is that for this SO question, we're not ready, but Parquet reading/writing is the blocker—maybe in other cases as well.

martindurant commented 4 years ago

Probably a long-hand demo using current fromarrow and just about any jitted function would be nice; the obvious benchmark comparison being the same column in JSON form, converted to python dictionaries.

jpivarski commented 4 years ago

There is this: https://awkward-array.org/what-is-awkward.html

though it starts from JSON, rather than Parquet.

martindurant commented 4 years ago

There is a jitted general-purpose function there? I am thinking of something, anything, that doesn't translate nicely into numpy syntax. From a benchmarking standpoint, to show awkward's usefulness for something like the SO question, it would have to start with a large amount of data in parquet.

jpivarski commented 4 years ago

Ah, I forgot where I did these things. It was in a talk, which I should feed back into the documentation so that it has a Numba example.

https://github.com/jpivarski/2020-07-06-scipy2020/blob/c951aa50bbfa879f011c9f2c343a3d0168be28e0/main.tex#L486-L503

import numba as nb                  #  50× faster than ak.sum version
@nb.jit                             # 250× faster than the pure Python
def compute_lengths(bikeroutes):
    route_length = np.zeros(len(bikeroutes.features))
    for i in range(len(bikeroutes.features)):
        for path in bikeroutes.features[i].geometry.coordinates:
            first = True
            last_east, last_north = 0.0, 0.0
            for lng_lat in path:
                km_east = lng_lat[0] * 82.7
                km_north = lng_lat[1] * 111.1
                if not first:
                    dx2 = (km_east - last_east)**2
                    dy2 = (km_north - last_north)**2
                    route_length[i] += np.sqrt(dx2 + dy2)
                first = False
                last_east, last_north = km_east, km_north
    return route_length

It would be great to start from Parquet, but that Parquet I/O through Arrow is the missing link.

(Incidentally, this example, chosen for pedagogy, has unimpressive speedup compared to other examples. The Awkward version is "only" 5-8× faster than pure Python but the Numba version with its single-pass over data and no intermediate arrays is 250× faster than pure Python. It happens to be a case where the performance advantage of Numba + Awkward is much more impressive than Awkward by itself—unless I find a performance bug in ak.sum or something. Given only this example, the story becomes "Awkward Array for convenience, Numba + Awkward for speed," though a different example might tell a different story.

Another performance point about this example: the array version of the dataset is 250 MB of RAM, 5.2× smaller than the equivalent Python objects. Coming from Parquet would mean no intermediate Python objects in the full workflow from columnar data file through Numba.)

martindurant commented 4 years ago

That is exactly the sort of thing I had in mind! from_arrow does work now, right? So the only thing missing would be to have a decent dataset in parquet format (perhaps the same bike stuff) and have a few lines to get the arrow record batch and convert it to ak.

jpivarski commented 4 years ago

Yes. I wonder if there's a JSON → Parquet converter somewhere. You'd have to assign a schema, but ak.type should help there. I take it you're thinking of a one-off Parquet → Arrow → ak.from_arrow → Numba without lazy column/row-group reading of the Parquet, right? (The bikeroutes file would be too small to justify row-groups, anyway.)

martindurant commented 4 years ago

No need for anything lazy for this, and if there were multiple row-groups, I would read them separately using Dask.

martindurant commented 4 years ago

I wonder if there's a JSON → Parquet converter somewhere

Spark? :) I don't know how effective https://arrow.apache.org/docs/python/json.html is

martindurant commented 4 years ago

Just to update, pyspark did successfully turn the Bikeroutes into parquet, but arrow can't load the output:

ArrowInvalid: Mix of struct and list types not yet supported

Very disappointed! Arrow also can't read the JSON directly, not sure what it's expecting.

The rendering of the schema in spark looks like this:

root
 |-- _corrupt_record: string (nullable = true)
 |-- geometry: struct (nullable = true)
 |    |-- coordinates: array (nullable = true)
 |    |    |-- element: array (containsNull = true)
 |    |    |    |-- element: array (containsNull = true)
 |    |    |    |    |-- element: double (containsNull = true)
 |    |-- type: string (nullable = true)
 |-- properties: struct (nullable = true)
 |    |-- BIKEROUTE: string (nullable = true)
 |    |-- F_STREET: string (nullable = true)
 |    |-- STREET: string (nullable = true)
 |    |-- TYPE: string (nullable = true)
 |    |-- T_STREET: string (nullable = true)
 |-- type: string (nullable = true)

but the actual parquet schema, rendered by fastparquet, is the following (because spark lists-type elements can be either [] or NULL):

- spark_schema:
| - _corrupt_record: BYTE_ARRAY, UTF8, OPTIONAL
| - geometry: OPTIONAL
| | - coordinates: LIST, OPTIONAL
| |   - list: REPEATED
| |     - element: LIST, OPTIONAL
| |       - list: REPEATED
| |         - element: LIST, OPTIONAL
| |           - list: REPEATED
| |             - element: DOUBLE, OPTIONAL
|   - type: BYTE_ARRAY, UTF8, OPTIONAL
| - properties: OPTIONAL
| | - BIKEROUTE: BYTE_ARRAY, UTF8, OPTIONAL
| | - F_STREET: BYTE_ARRAY, UTF8, OPTIONAL
| | - STREET: BYTE_ARRAY, UTF8, OPTIONAL
| | - TYPE: BYTE_ARRAY, UTF8, OPTIONAL
|   - T_STREET: BYTE_ARRAY, UTF8, OPTIONAL
  - type: BYTE_ARRAY, UTF8, OPTIONAL
martindurant commented 4 years ago

Oh wait, I think spark is assuming one record par line, rather than a well-formed single JSON object

martindurant commented 4 years ago

OK, that works fine for spark, arrow still fails with the same error. Schema according to fastparquet is

- spark_schema:
| - crs: OPTIONAL
| | - properties: OPTIONAL
| |   - name: BYTE_ARRAY, UTF8, OPTIONAL
|   - type: BYTE_ARRAY, UTF8, OPTIONAL
| - features: LIST, OPTIONAL
|   - list: REPEATED
|     - element: OPTIONAL
|     | - geometry: OPTIONAL
|     | | - coordinates: LIST, OPTIONAL
|     | |   - list: REPEATED
|     | |     - element: LIST, OPTIONAL
|     | |       - list: REPEATED
|     | |         - element: LIST, OPTIONAL
|     | |           - list: REPEATED
|     | |             - element: DOUBLE, OPTIONAL
|     |   - type: BYTE_ARRAY, UTF8, OPTIONAL
|     | - properties: OPTIONAL
|     | | - BIKEROUTE: BYTE_ARRAY, UTF8, OPTIONAL
|     | | - F_STREET: BYTE_ARRAY, UTF8, OPTIONAL
|     | | - STREET: BYTE_ARRAY, UTF8, OPTIONAL
|     | | - TYPE: BYTE_ARRAY, UTF8, OPTIONAL
|     |   - T_STREET: BYTE_ARRAY, UTF8, OPTIONAL
|       - type: BYTE_ARRAY, UTF8, OPTIONAL
  - type: BYTE_ARRAY, UTF8, OPTIONAL
jpivarski commented 4 years ago

Maybe remove the structs, so that it's just the ListArray(ListArray(ListArray(ListArray(double)))) representing the coordinates? In the past, I've found that structs are fine for conversions to and from Arrow, but then Arrow ↔ Parquet has only implemented nested lists without any structs.

The majority of the exercise can be done with only the coordinates (see the Awkward example for quickly projecting them out; probably easier than in any other framework).

Or maybe it's the Java/Spark converter that's limited: perhaps

as_arrow = ak.to_arrow(bikeroutes)

and then convert as_arrow to Parquet with pyarrow. The C++ implementation used to have this limitation about structs in Parquet, but maybe they've finished that by now.

(The disappointing thing is that changing the nested struct configuration of columnar data is purely metadata transformations; Arrow should be able to do this!)

martindurant commented 4 years ago

Interesting case, I wonder if you can tell me the difference between

bikeroutes = ak.from_json("Bikeroutes.geojson")

and

bikeroutes_json = open("Bikeroutes.geojson").read()
bikeroutes_pyobj = json.loads(bikeroutes_json)
bikeroutes2 = ak.Record(bikeroutes_pyobj)

The former has some outer objects, but bikeroutes[0][0] appears identical to bikeroutes2; however, when used with the compute_lengths function, the former takes 1.95ms versus 170us (on my machine). Obviously, I would prefer the simpler-looking syntax.

jpivarski commented 4 years ago

I think they ought to be the same, but I didn't manage to fix the issue with ak.from_json before I needed to make this demo. In ak.from_json, I simply forgot to consider the case in which the JSON is a single JSON object, rather than a JSON array.

Internally, ak.from_json calls a C++ Content::fromjson method, which uses RapidJSON in SAX mode to fill an ArrayBuilder. Not going through the Python objects would be a time-saver (and probably memory, too).

The ak.Record constructor would call ak.from_json if the argument had been a string. Since it's a Python object, it uses pybind11 to walk over the Python objects, feeding an ArrayBuilder.

ArrayBuilder assumes that what it's getting is some kind of array (not a single record). In other words, for arrays, you don't have to start with begin_list and end with end_list, just adding objects fills it with the assumption that they are array elements. This also means there's one fewer error condition to consider: adding a second object is not an error, but it would be if ArrayBuilder didn't assume that what it's getting is an array. Also, ArrayBuilder is user-visible, and forgetting to call begin_list and end_list outside of the main loop would be a common error, if it had been required.

Thus, if the data are in one big record, that's the special case that has to be handled: first by recognizing that the wrapping structure is a record, then by implicitly calling begin_record and end_record, then by pulling the finished record out of the length-one array. The ak.Record constructor does this because just by using it, the user is communicating the fact that they want a record. ak.from_json should go through the same procedure once it recognizes that the outermost JSON structure is a JSON object, not a JSON array.

I'll make an issue out of this.

martindurant commented 4 years ago

Thanks for the detail, @jpivarski !