On which installation method(s) does this occur?


Describe the issue

See the write-up at https://github.com/rapidsai/cuspatial/pull/1407#issuecomment-2234181801.

Since around July 12, 2024, the nyc_taxi_years_correlation.ipynb started taking several hours to complete (on v24.08, using 24.08 cudf and other RAPIDS nightlies). Prior to that, on the exact same hardware, it completed in under 8 minutes.

I was able to reproduce this interactively, on a machine with 8 V100s and CUDA 12.2.

I strongly suspect that this indicates a performance regression, maybe of the form "some change(s) in cudf cause a cuspatial codepath that could previously execute on the GPU to fall back to the CPU", although I don't have profiling output to provide as evidence.

Minimum reproducible example

From https://github.com/rapidsai/cuspatial/pull/1407#issuecomment-2234181801.

Download the input data.

if [ ! -f "tzones_lonlat.json" ]; then
    curl "https://data.cityofnewyork.us/api/geospatial/d3c5-ddgc?method=export&format=GeoJSON" -o tzones_lonlat.json;
    echo "tzones_lonlat.json found";
if [ ! -f "taxi2016.csv" ]; then
    curl https://storage.googleapis.com/anaconda-public-data/nyc-taxi/csv/2016/yellow_tripdata_2016-01.csv -o taxi2016.csv;
    echo "taxi2016.csv found";

Then, in a Python 3.11 session (with v24.08 of cuspatial and all its RAPIDS dependencies).

import cuspatial
import geopandas as gpd
import cudf
import numpy as np

taxi2016 = cudf.read_csv("taxi2016.csv")
tzones = gpd.GeoDataFrame.from_file('tzones_lonlat.json')
taxi_zones = cuspatial.from_geopandas(tzones).geometry
taxi_zone_rings = cuspatial.GeoSeries.from_polygons_xy(

def make_geoseries_from_lonlat(lon, lat):
    lonlat = cudf.DataFrame({"lon": lon, "lat": lat}).interleave_columns()
    return cuspatial.GeoSeries.from_points_xy(lonlat)

pickup2016 = make_geoseries_from_lonlat(taxi2016['pickup_longitude'] , taxi2016['pickup_latitude'])
dropoff2016 = make_geoseries_from_lonlat(taxi2016['dropoff_longitude'] , taxi2016['dropoff_latitude'])

pip_iterations = list(np.arange(0, 263, 31))

taxi2016['PULocationID'] = 264
taxi2016['DOLocationID'] = 264

start = pip_iterations[0]
end = pip_iterations[1]

zone = taxi_zone_rings[start:end]

# find all pickups in that zone
pickups = cuspatial.point_in_polygon(pickup2016, zone)
dropoffs = cuspatial.point_in_polygon(dropoff2016, zone)

That one combination of polygons completed successfully, but took 21 to complete. It's the 2 points_in_polygon() calls that took around 20 of those 21 minutes.

And in the notebook, 10 such combinations are processed.


[0, 31, 62, 93, 124, 155, 186, 217, 248, 263]


So conservatively, it might take 3.5 hours for the notebook to finish in my setup. And that's making a LOT of assumptions.

Relevant log output


Environment details

Both these environments:

Using cudf (and other RAPIDS dependencies) nightly conda packages as of July 12, 2024.

Other symptoms that led to this were documented in #1406.

That was closed by just skipping the most expensive notebooks, in #1407.

harrism commented 1 month ago

@trxcllnt recently modified point_in_polygon. Could those changes have caused this?

jameslamb commented 1 month ago

Are you referring to #1381?

It could be related, but I don't think it'd be the root cause by itself. Those changes were made 2+ months ago, and as recently as #1404 (2 weeks ago), the conda-notebook-tests CI job here was completing in around 9 minutes (build link).

isVoid commented 1 month ago

Also that PR modified the quadtree PiP algo, but the algo in question here is the non-quadtree version.

harrism commented 1 month ago

I did some profiling using pyspy. This is not a complete profile, I have just been running for about 4.5 minutes using py-spy top -- python test.py (test.py contains the code above).

Collecting samples from 'python test.py' (python v3.10.14)
Total Samples 38284
GIL: 100.00%, Active: 100.00%, Threads: 1

  %Own   %Total  OwnTime  TotalTime  Function (filename:line)                                                                                                                                                                                        
 40.00%  79.00%   159.3s    272.1s   compute_index (numba/misc/dummyarray.py:111)
 18.00%  39.00%   57.19s    112.8s   <genexpr> (numba/misc/dummyarray.py:111)
 21.00%  21.00%   55.60s    55.60s   get_offset (numba/misc/dummyarray.py:83)
  8.00%   8.00%   20.65s    20.65s   iter_contiguous_extent (numba/misc/dummyarray.py:275)
  0.00%   0.00%   17.83s    17.83s   iter_contiguous_extent (numba/misc/dummyarray.py:270)
 10.00%  89.00%   15.99s    166.7s   iter_contiguous_extent (numba/misc/dummyarray.py:274)
  0.00%   0.00%   15.00s    136.3s   iter_contiguous_extent (numba/misc/dummyarray.py:269)
  0.00%   0.00%    8.25s     8.25s   iter_contiguous_extent (numba/misc/dummyarray.py:268)
  2.00%   2.00%    8.06s     8.06s   iter_contiguous_extent (numba/misc/dummyarray.py:273)
  0.00% 100.00%    6.88s    375.5s   __getitem__ (numba/cuda/cudadrv/devicearray.py:630)
  0.00%   0.00%    5.59s    170.6s   __getitem__ (numba/misc/dummyarray.py:239)
  1.00% 100.00%    2.61s    198.0s   _do_getitem (numba/cuda/cudadrv/devicearray.py:642)
  0.00%   0.00%    2.58s    165.0s   reshape (numba/misc/dummyarray.py:351)
  0.00%   0.00%    2.32s     2.33s   read_csv (cudf/io/csv.py:96)
  0.00%   0.00%   0.800s     2.61s   _call_with_frames_removed (<frozen importlib._bootstrap>:241)
  0.00%   0.00%   0.210s    0.210s   point_in_polygon (cuspatial/core/spatial/join.py:82)
  0.00%   0.00%   0.180s    0.180s   _compile_bytecode (<frozen importlib._bootstrap_external>:672)
  0.00%   0.00%   0.150s    0.180s   inner (contextlib.py:79)
  0.00%   0.00%   0.140s    0.140s   append (numba/core/byteflow.py:1743)
  0.00%   0.00%   0.130s    0.130s   __init__ (fiona/collection.py:243)
  0.00%   0.00%   0.130s    0.130s   <listcomp> (shapely/geometry/polygon.py:91)

Nearly all the time is spent in Numba. I used py-spy to output this svg (but only ran it for about a minute). But this flame plot gives an idea of where Numba is being called.


harrism commented 1 month ago

@mroeschke since you have touched a lot of places in cuSpatial and cuDF recently can you tell us if this code perhaps is now running in numba but didn't used to? That could explain the huge performance regression we are seeing.