holoviz / datashader

Quickly and accurately render even the largest data.
http://datashader.org
BSD 3-Clause "New" or "Revised" License
3.24k stars 363 forks source link

Experiment with awkward-pandas #1303

Open ianthomas23 opened 8 months ago

ianthomas23 commented 8 months ago

This is a write up of the status of an experiment in using awkward-pandas to read spatialpandas-like polygon data and render in datashader. The related WIP branch is at https://github.com/holoviz/datashader/tree/awkward_geom.

Datashader currently uses SpatialPandas to read geometry data from parquet files, iterate over it quickly using numba, and render it. The data is stored and loaded as ragged arrays, i.e. one contiguous array of point coordinates and one or more array of integer offsets that reference the start and end of geometric primitives in the point coordinate array. SpatialPandas exists to do this as a Pandas extension array, i.e. with one geometry per row in the DataFrame, and also to perform spatial indexing so that a set of geometries can be spatially subsampled so that within Datashader we only have to walk through the geometries within a bounding box rather than all of them.

The data format is perfectly suited to Awkward Array and its related packages Awkward-Pandas and Dask-Awkward. There is the possibility that the internals of SpatialPandas could be replaced with an Awkward implementation instead. This would mean that the bespoke Pandas extension array code in SpatialPandas could be removed as it would be handled by Awkward-Pandas.

There is an example ak_vs_sp.py in the root directory of the above mentioned branch which reads in the NYC buildings (polygon) dataset from parquet file and renders it using both SpatialPandas and Awkward-Pandas. Timing-wise the Awkward-Pandas approach is faster than SpatialPandas although this is mostly because SpatialPandas does some copying of NumPy arrays that could be avoided.

Here is the Awkward-Array output:

ak_vs_sp1

If anyone wants to continue with this experiment, here is a possible order of operations:

  1. Bring the branch up to date by rebasing against main.
  2. Check if the outputs are pixel identical for Awkward-Pandas and SpatialPandas.
  3. Use Dask-Awkward to process in parallel on multiple CPUs.
  4. Support all the various geometry types that SpatialPandas supports (polygon, multipolygon, line, multiline, point, multipoint, ring) which are essentially different levels of nested offsets.
  5. Consider how spatial indexing could be incorporated into this, as without it the performance when viewing a subset of the geometry will be poor.