Closed martindurant closed 4 years ago
(Yeah, this is really about documentation, because we need an example for that SO question.)
The missing piece is Parquet reading and writing. It was done through pyarrow in the old version and should be again: ak.to_arrow and ak.from_arrow exist, as do PartitionedArray and VirtualArray for lazily loading Parquet columns and row groups. Then #172 would provide some good tests to verify it (as would the old tests).
The Numba JIT-compiling is fairly mature; I've demoed it a number of times, and I'm usually fighting Numba type errors, not Awkward errors/missing features. ArrayBuilder provides an easier way to generate a complex schema'ed output than I've seen elsewhere, though understandably it's not as fast as specialized code. (ArrayBuilder internally has dynamic typing, which is something we'd have to warn against for performance-conscious users.) That would probably be the easiest way to make a Parquet file.
So maybe the right thing to say is that for this SO question, we're not ready, but Parquet reading/writing is the blocker—maybe in other cases as well.
Probably a long-hand demo using current fromarrow and just about any jitted function would be nice; the obvious benchmark comparison being the same column in JSON form, converted to python dictionaries.
There is this: https://awkward-array.org/what-is-awkward.html
though it starts from JSON, rather than Parquet.
There is a jitted general-purpose function there? I am thinking of something, anything, that doesn't translate nicely into numpy syntax. From a benchmarking standpoint, to show awkward's usefulness for something like the SO question, it would have to start with a large amount of data in parquet.
Ah, I forgot where I did these things. It was in a talk, which I should feed back into the documentation so that it has a Numba example.
import numba as nb # 50× faster than ak.sum version
@nb.jit # 250× faster than the pure Python
def compute_lengths(bikeroutes):
route_length = np.zeros(len(bikeroutes.features))
for i in range(len(bikeroutes.features)):
for path in bikeroutes.features[i].geometry.coordinates:
first = True
last_east, last_north = 0.0, 0.0
for lng_lat in path:
km_east = lng_lat[0] * 82.7
km_north = lng_lat[1] * 111.1
if not first:
dx2 = (km_east - last_east)**2
dy2 = (km_north - last_north)**2
route_length[i] += np.sqrt(dx2 + dy2)
first = False
last_east, last_north = km_east, km_north
return route_length
It would be great to start from Parquet, but that Parquet I/O through Arrow is the missing link.
(Incidentally, this example, chosen for pedagogy, has unimpressive speedup compared to other examples. The Awkward version is "only" 5-8× faster than pure Python but the Numba version with its single-pass over data and no intermediate arrays is 250× faster than pure Python. It happens to be a case where the performance advantage of Numba + Awkward is much more impressive than Awkward by itself—unless I find a performance bug in ak.sum or something. Given only this example, the story becomes "Awkward Array for convenience, Numba + Awkward for speed," though a different example might tell a different story.
Another performance point about this example: the array version of the dataset is 250 MB of RAM, 5.2× smaller than the equivalent Python objects. Coming from Parquet would mean no intermediate Python objects in the full workflow from columnar data file through Numba.)
That is exactly the sort of thing I had in mind! from_arrow does work now, right? So the only thing missing would be to have a decent dataset in parquet format (perhaps the same bike stuff) and have a few lines to get the arrow record batch and convert it to ak.
Yes. I wonder if there's a JSON → Parquet converter somewhere. You'd have to assign a schema, but ak.type should help there. I take it you're thinking of a one-off Parquet → Arrow → ak.from_arrow → Numba without lazy column/row-group reading of the Parquet, right? (The bikeroutes file would be too small to justify row-groups, anyway.)
No need for anything lazy for this, and if there were multiple row-groups, I would read them separately using Dask.
I wonder if there's a JSON → Parquet converter somewhere
Spark? :) I don't know how effective https://arrow.apache.org/docs/python/json.html is
Just to update, pyspark did successfully turn the Bikeroutes into parquet, but arrow can't load the output:
ArrowInvalid: Mix of struct and list types not yet supported
Very disappointed! Arrow also can't read the JSON directly, not sure what it's expecting.
The rendering of the schema in spark looks like this:
root
|-- _corrupt_record: string (nullable = true)
|-- geometry: struct (nullable = true)
| |-- coordinates: array (nullable = true)
| | |-- element: array (containsNull = true)
| | | |-- element: array (containsNull = true)
| | | | |-- element: double (containsNull = true)
| |-- type: string (nullable = true)
|-- properties: struct (nullable = true)
| |-- BIKEROUTE: string (nullable = true)
| |-- F_STREET: string (nullable = true)
| |-- STREET: string (nullable = true)
| |-- TYPE: string (nullable = true)
| |-- T_STREET: string (nullable = true)
|-- type: string (nullable = true)
but the actual parquet schema, rendered by fastparquet, is the following (because spark lists-type elements can be either []
or NULL
):
- spark_schema:
| - _corrupt_record: BYTE_ARRAY, UTF8, OPTIONAL
| - geometry: OPTIONAL
| | - coordinates: LIST, OPTIONAL
| | - list: REPEATED
| | - element: LIST, OPTIONAL
| | - list: REPEATED
| | - element: LIST, OPTIONAL
| | - list: REPEATED
| | - element: DOUBLE, OPTIONAL
| - type: BYTE_ARRAY, UTF8, OPTIONAL
| - properties: OPTIONAL
| | - BIKEROUTE: BYTE_ARRAY, UTF8, OPTIONAL
| | - F_STREET: BYTE_ARRAY, UTF8, OPTIONAL
| | - STREET: BYTE_ARRAY, UTF8, OPTIONAL
| | - TYPE: BYTE_ARRAY, UTF8, OPTIONAL
| - T_STREET: BYTE_ARRAY, UTF8, OPTIONAL
- type: BYTE_ARRAY, UTF8, OPTIONAL
Oh wait, I think spark is assuming one record par line, rather than a well-formed single JSON object
OK, that works fine for spark, arrow still fails with the same error. Schema according to fastparquet is
- spark_schema:
| - crs: OPTIONAL
| | - properties: OPTIONAL
| | - name: BYTE_ARRAY, UTF8, OPTIONAL
| - type: BYTE_ARRAY, UTF8, OPTIONAL
| - features: LIST, OPTIONAL
| - list: REPEATED
| - element: OPTIONAL
| | - geometry: OPTIONAL
| | | - coordinates: LIST, OPTIONAL
| | | - list: REPEATED
| | | - element: LIST, OPTIONAL
| | | - list: REPEATED
| | | - element: LIST, OPTIONAL
| | | - list: REPEATED
| | | - element: DOUBLE, OPTIONAL
| | - type: BYTE_ARRAY, UTF8, OPTIONAL
| | - properties: OPTIONAL
| | | - BIKEROUTE: BYTE_ARRAY, UTF8, OPTIONAL
| | | - F_STREET: BYTE_ARRAY, UTF8, OPTIONAL
| | | - STREET: BYTE_ARRAY, UTF8, OPTIONAL
| | | - TYPE: BYTE_ARRAY, UTF8, OPTIONAL
| | - T_STREET: BYTE_ARRAY, UTF8, OPTIONAL
| - type: BYTE_ARRAY, UTF8, OPTIONAL
- type: BYTE_ARRAY, UTF8, OPTIONAL
Maybe remove the structs, so that it's just the ListArray(ListArray(ListArray(ListArray(double)))) representing the coordinates? In the past, I've found that structs are fine for conversions to and from Arrow, but then Arrow ↔ Parquet has only implemented nested lists without any structs.
The majority of the exercise can be done with only the coordinates (see the Awkward example for quickly projecting them out; probably easier than in any other framework).
Or maybe it's the Java/Spark converter that's limited: perhaps
as_arrow = ak.to_arrow(bikeroutes)
and then convert as_arrow
to Parquet with pyarrow. The C++ implementation used to have this limitation about structs in Parquet, but maybe they've finished that by now.
(The disappointing thing is that changing the nested struct configuration of columnar data is purely metadata transformations; Arrow should be able to do this!)
Interesting case, I wonder if you can tell me the difference between
bikeroutes = ak.from_json("Bikeroutes.geojson")
and
bikeroutes_json = open("Bikeroutes.geojson").read()
bikeroutes_pyobj = json.loads(bikeroutes_json)
bikeroutes2 = ak.Record(bikeroutes_pyobj)
The former has some outer objects, but bikeroutes[0][0]
appears identical to bikeroutes2
; however, when used with the compute_lengths
function, the former takes 1.95ms versus 170us (on my machine). Obviously, I would prefer the simpler-looking syntax.
I think they ought to be the same, but I didn't manage to fix the issue with ak.from_json
before I needed to make this demo. In ak.from_json
, I simply forgot to consider the case in which the JSON is a single JSON object, rather than a JSON array.
Internally, ak.from_json
calls a C++ Content::fromjson
method, which uses RapidJSON in SAX mode to fill an ArrayBuilder. Not going through the Python objects would be a time-saver (and probably memory, too).
The ak.Record
constructor would call ak.from_json
if the argument had been a string. Since it's a Python object, it uses pybind11 to walk over the Python objects, feeding an ArrayBuilder.
ArrayBuilder assumes that what it's getting is some kind of array (not a single record). In other words, for arrays, you don't have to start with begin_list
and end with end_list
, just adding objects fills it with the assumption that they are array elements. This also means there's one fewer error condition to consider: adding a second object is not an error, but it would be if ArrayBuilder didn't assume that what it's getting is an array. Also, ArrayBuilder is user-visible, and forgetting to call begin_list
and end_list
outside of the main loop would be a common error, if it had been required.
Thus, if the data are in one big record, that's the special case that has to be handled: first by recognizing that the wrapping structure is a record, then by implicitly calling begin_record
and end_record
, then by pulling the finished record out of the length-one array. The ak.Record
constructor does this because just by using it, the user is communicating the fact that they want a record. ak.from_json
should go through the same procedure once it recognizes that the outermost JSON structure is a JSON object, not a JSON array.
I'll make an issue out of this.
Thanks for the detail, @jpivarski !
Prompted by this stackoverflow question, to which my internal answer was "that's what awkward is for". Is there yet an example of how you might do the circuit of parquet file with complicated schema, to numba-jit function and fast processing.