rapidsai / cuspatial

CUDA-accelerated GIS and spatiotemporal algorithms
https://docs.rapids.ai/api/cuspatial/stable/
Apache License 2.0
620 stars 154 forks source link

inconsistent result with cuspatial.point_in_polygon #492

Open hakantekbas opened 2 years ago

hakantekbas commented 2 years ago

Hi;

I encountered strange result when I run notebook below.

https://github.com/rapidsai/cuspatial/blob/branch-22.04/notebooks/nyc_taxi_years_correlation.ipynb

To show inconsistency, I did same calculation in 2 different ways. First one is pickups, second one is pickups2. Last column in pickups is calculated again in first column of pickups2. Strange part is to get different results in two calculations. Sum of true values in last column of pickups should be equal to first column of pickups2. In each calculation, sum of true values in last column is nearly the number rows. It is obvious that there is something wrong here. Can you help me to figure out?

start = 0
end = 4
pickups = cuspatial.point_in_polygon(taxi2016['pickup_longitude'] , taxi2016['pickup_latitude'], taxi_zones[0][start:end], taxi_zones[1], taxi_zones[2]['x'], taxi_zones[2]['y'])
pickups2 = cuspatial.point_in_polygon(taxi2016['pickup_longitude'] , taxi2016['pickup_latitude'], taxi_zones[0][3:6], taxi_zones[1], taxi_zones[2]['x'], taxi_zones[2]['y']) 

pickups

            0      1      2     3
0         False  False  False  True
1         False  False  False  True
2         False  False  False  True
3         False  False  False  True
4         False  False  False  True
...         ...    ...    ...   ...
10906853  False  False  False  True
10906854  False  False  False  True
10906855  False  False  False  True
10906856  False  False  False  True
10906857  False  False  False  True

count of true values in pickups

0         605
1           1
2          32
3    10727122
dtype: uint64

pickups2

              3      4     5
0         False  False  True
1         False  False  True
2         False  False  True
3         False  False  True
4         False  False  True
...         ...    ...   ...
10906853  False  False  True
10906854  False  False  True
10906855  False  False  True
10906856  False  False  True
10906857  False  False  True

count of true values in pickups2

3       34894
4           3
5    10692225
dtype: uint64
hakantekbas commented 2 years ago

I want to give one more detail. This method only supports to accept 31 polygons at one time. If I create seperate shape file for every 31 polygons, it is running 10 times faster and results are correct.

for i in range(len(pip_iterations)-1):
    timeStart = time.time()
    start = pip_iterations[i]
    end = pip_iterations[i+1]

    tzones.iloc[start:end].to_file('/data_ssd/hakan/mobilData/spatialAnalysis/data/cu_taxi_zones.shp')
    taxi_zones = cuspatial.read_polygon_shapefile('/data_ssd/hakan/mobilData/spatialAnalysis/data/cu_taxi_zones.shp')
    pickups = cuspatial.point_in_polygon(taxi2016['pickup_longitude'] , taxi2016['pickup_latitude'], \
        taxi_zones[0], taxi_zones[1], taxi_zones[2]['x'], taxi_zones[2]['y'])
    for j in pickups.columns:
         taxi2016['atama'].loc[pickups[j]] = j + start
zhangjianting commented 2 years ago

There is an issue in your first program. According to https://docs.rapids.ai/api/cuspatial/stable/api.html, in the API cuspatial.point_in_polygon(test_points_x, test_points_y, poly_offsets, poly_ring_offsets, poly_points_x, poly_points_y) poly_offsets and poly_ring_offsets were designed to begin with 0. There could be side effects if not. Your 2nd program happens to satisfy the requirement when each polygon is read from shapefile separately.
I am investigating what might happen when use taxi_zones[0][start:end] instead of taxi_zones[0] for poly_offsets.

hakantekbas commented 2 years ago

Thanks. When you say there is an issue in my code, are you pointing pickups2 ? I will ask another question about cuspatial.quadtree_point_in_polygon. Is this method faster than point_in_polygon? or quadtree_point_in_polygon is already incorporated in point_in_polygon?

zhangjianting commented 2 years ago

@hakantekbas yes. poly_offsets and poly_ring_offsets arrays were designed to begin with 0. I am trying to repeat your results. Unfortunately I am not a good Python programmer. What are the command(s) you used to generate the count lists?
quadtree_point_in_polygon indexes points first and then perform point-in-polygon test based spatial join. There is no limit on the # of polygons in quadtree_point_in_polygon and the results are (point_offset, polygon_offset) pairs, which is different from the result format of point_in_polygon.

hakantekbas commented 2 years ago

Hİ @zhangjianting; You can try following command. But if you look at pickups dataframe, you should see many trues in last column.

pickups.sum()

I will also ask how I can apply quadtree_point_in_polygon for this case and I also wonder whether this method is faster than cuspatial.point_in_polygon. Results obtained from following code does not look correct.

min_x = taxi2016.pickup_longitude.min()
max_x = taxi2016.pickup_longitude.max()
min_y = taxi2016.pickup_latitude.min()
max_y = taxi2016.pickup_latitude.max()
min_size = 50
max_depth = 4
scale = max(max_x - min_x, max_y - min_y) // (1 << max_depth)

pip_iterations = list(np.arange(0, 263, 30))
pip_iterations.append(263)
print(pip_iterations)
finalList = []
taxi2016["atama"]=264

for i in range(len(pip_iterations)-1):
    start = pip_iterations[i]
    end = pip_iterations[i+1]
    zoneShape = tzones.iloc[start:end]
    zoneShape.to_file('/data_ssd/hakan/mobilData/spatialAnalysis/data/cu_taxi_zones.shp')
    taxi_zones = cuspatial.read_polygon_shapefile('/data_ssd/hakan/mobilData/spatialAnalysis/data/cu_taxi_zones.shp')
    print("start:%s  , end:%s" % (start,end))

    key_to_point, quadtree = cuspatial.quadtree_on_points(
            taxi2016["pickup_longitude"],
            taxi2016["pickup_latitude"],
            min_x,
            max_x,
            min_y,
            max_y,
            scale, max_depth, min_size
        )

    polyBoundingBox = cuspatial.polygon_bounding_boxes(taxi_zones[0],taxi_zones[1],taxi_zones[2]["x"],taxi_zones[2]["y"])
    polyQuadOffset = cuspatial.join_quadtree_and_bounding_boxes(quadtree, polyBoundingBox, min_x, max_x, min_y, max_y, scale, max_depth)

    startTime = time.time()
    test = cuspatial.quadtree_point_in_polygon(polyQuadOffset, quadtree, key_to_point, taxi2016['pickup_longitude'], taxi2016['pickup_latitude'], 
                taxi_zones[0], taxi_zones[1], taxi_zones[2]['x'], taxi_zones[2]['y'])
    taxi2016.iloc[test.point_index,taxi2016.columns.tolist().index("atama")] = test.to_pandas()["polygon_index"] + zoneShape.index.min()
    print(time.time()-startTime)
    print("aaa")
zhangjianting commented 2 years ago

In my previous answer, "poly_offsets and poly_ring_offsets were designed to begin with 0" was not accurate. The API expects values of poly_offsets matches the values of poly_ring_offsets for all the polygons being tested. This is naturally satisfied when a shapefile is read out and used to populate the two arrays. However, using a subset of poly_offsets and the whole set of poly_ring_offsets breaks the matching requirement. It turns out that, when a subset of polygons are used, the result for the last polygon in the subset would be incorrect. But the result for all the other polygons should be correct. Splitting polygon shapefile would be a good workaround for now. I am discussing with cuspatial team to find a good fix. Thanks for the bug report.

hakantekbas commented 2 years ago

Thanks @zhangjianting for fast response. I also want to hear your thoughts about cuspatial.quadtree_point_in_polygon. If you can give more details about how to use this method and why my calculation does not work in my case. or maybe I'm applying correctly but calculation is wrong because this bug is also affecting this one. even so I expect to get same results. but results are very different.

zhangjianting commented 2 years ago

@hakantekbas (min_x,min_y,max_x,max_y) (-121.93428802490236, 0.0, 0.0, 60.908756256103516) This is not a good space to apply quadtree indexing using min_size = 50 and max_depth = 4. I would suggest filtering out records that are not in NYC area (with/without including neighboring NJ/CT areas) and try min_size=512 and max_depth=15.
For the example at https://github.com/zhangjianting/cuspatial_benchmark_nyctaxi/blob/main/python/spatial_join.py, EPSG 2263 projection was used for both point/polygon data. It should work for lon/lat (EPSG 4326). For discussions on min_size and max_depth, see [http://www.adms-conf.org/2019-camera-ready/zhang_adms19.pdf]

zhangjianting commented 2 years ago

@hakantekbas point_in_polygon and quadtree_point_in_polygon are completely independent.

hakantekbas commented 2 years ago

@zhangjianting thanks for sharing python file in github, and explanation

zhangjianting commented 2 years ago

"If I create seperate shape file for every 31 polygons, it is running 10 times faster and results are correct."

My result shows that the majority of the time was spent on converting integer/bitmap vector form Cython/C++ results to DataFrame. We might be able to accelerate it at native CUDA level.

Page 3 of https://github.com/zhangjianting/cuspatial/blob/fea-spatial-join/docs/basic_spatial_trajectory_operators.pdf reported that for test of 1.3 million points in 27 (simple) polygons, the GPU runtime is 0.92 millisecond.

For the performance you have got, my guess is that it took smaller amount time to create 10 dataframes with 1 column than to create 1 data frame with 10 columns, although I do not have proof yet.

cuspatial.point_in_polygon was designed to for small problems (e.g., a few millions of points and <31 polygons). The dataframe return type is for convenience, not for performance. If performance is important, you can skip the dataframe wrapping and use lower level API directly (see example below):

result = cuspatial.core.gis.cpp_point_in_polygon( taxi2016['pickup_longitude']._column, taxi2016['pickup_latitude']._column, as_column(taxi_zones[0], dtype="int32"), as_column(taxi_zones[1], dtype="int32"), taxi_zones[2]['x']._column, taxi_zones[2]['y']._column) result=result.data_array_view.copy_to_host() Where each bit of result[i] indicates whether points[i] is in the corresponding polygon (first polygon at the highest bit if I remember correctly).

If the number of points is in the order of tens of millions, quadtree indexing would be beneficial. When you have a chance, please let me know the performance your get.

hakantekbas commented 2 years ago

Hi @zhangjianting I will try the last one at weekend. I will keep you informed.

If I create polygon shape file for each group of 31 polygons, it is 10 times faster and it seems results are correct. I'm applying below code given in github page in base case. I believe performance degradation is caused by bug in point_in_polygon function.

pickups = cuspatial.point_in_polygon(taxi2016['pickup_longitude'] , taxi2016['pickup_latitude'], taxi_zones[0][start:end], taxi_zones[1], taxi_zones[2]['x'], taxi_zones[2]['y'])

Thanks.

hakantekbas commented 2 years ago

Hi @zhangjianting

I received following error. It seems this method is also expecting 31 polygons maximum.

RuntimeError: cuSpatial failure at: /workspace/.conda-bld/work/cpp/src/spatial/point_in_polygon.cu:202: Number of polygons cannot exceed 31

If I put a limit on the number of polygons, it is working but returning values are not correct. Maybe I misterpretted. I copied unique values of result.

array([        0,         1,         2,         4,         8,        16,
              32,        64,       128,       256,       512,      1024,
            2048,      4096,      8192,     16384,     32768,     65536,
          131072,    262144,    524288,   1048576,   2097152,   4194304,
         8388608,  16777216,  33554432,  67108864, 134217728, 268435456,
       536870912], dtype=int32)
zhangjianting commented 2 years ago

@hakantekbas Can you post the code segment you used? It looks like that the non-indexed version of point-in-polygon code was used. What is the array that you displayed?

hakantekbas commented 2 years ago

Hi @zhangjianting

I displayed np.unique(result). All data files are downloaded from the links shared in notebook.

Thanks.

import cuspatial
from cuspatial.core.gis import as_column
import pandas as pd
import geopandas as gpd
import cudf
import numpy as np
import time

taxi2016 = cudf.read_csv("/data_ssd/hakan/mobilData/spatialAnalysis/data/taxi2016.csv")

tzones = gpd.GeoDataFrame.from_file('/data_ssd/hakan/mobilData/spatialAnalysis/data/tzones_lonlat.json')
tzones.iloc[0:30].to_file('/data_ssd/hakan/mobilData/spatialAnalysis/data/cu_taxi_zones.shp')

taxi_zones = cuspatial.read_polygon_shapefile('/data_ssd/hakan/mobilData/spatialAnalysis/data/cu_taxi_zones.shp')
shapeFileSehir = gpd.read_file("/data_ssd/hakan/mobilData/spatialAnalysis/data/cu_taxi_zones.shp")

taxi_zones = cuspatial.read_polygon_shapefile('/data_ssd/hakan/mobilData/spatialAnalysis/data/cu_taxi_zones.shp')

s = cuspatial.core.gis.cpp_point_in_polygon(taxi2016['pickup_longitude']._column,
                                            taxi2016['pickup_latitude']._column,
                                            as_column(taxi_zones[0], dtype="int32"),
                                            as_column(taxi_zones[1], dtype="int32"),
                                            taxi_zones[2]['x']._column,
                                            taxi_zones[2]['y']._column)
s=s.data_array_view.copy_to_host()
zhangjianting commented 2 years ago

@hakantekbas I see. You called cython API directly for efficiency. This is still the regular point-in-polygon test API and has the 31 polygon limit.

As indicated In my previous comment, for your output s column vector (which is "result" in my code), "where each bit of result[i] indicates whether points[i] is in the corresponding polygon". You can use "bin(s[i]).count("1")" to count the number of polygons that points[i]) falls within (typically just one).

The unique values in your output are actually 1<<(31-j) where j vary from 0 to 31. It may indicate that no points fall within more than one polygon since the number of unique values is 31. This is expected. The counts of the unique values should be the numbers of points that fall within the 31 polygons. This could be a more efficient way to implement dataframe.sum() if sum is all you need.

Please see https://github.com/rapidsai/cuspatial/blob/90be36d1348adef236ffd14abf1829820a46040b/python/cuspatial/cuspatial/core/gis.py#L249 on how the column vector is converted to DataFrame. The conversion overhead is justifiable when users want to use the convenience of Dataframe (e.g., sum()).

zhangjianting commented 2 years ago

@hakantekbas I repeated your experiment and got the following results. The two results agree each other. tzones = gpd.GeoDataFrame.from_file('cu_taxi_zones.shp') tzones.iloc[0:30].to_file('temp_taxi_zones.shp') taxi_zones = cuspatial.read_polygon_shapefile('temp_taxi_zones.shp') s = cuspatial.core.gis.cpp_point_in_polygon( taxi2016['pickup_longitude']._column, taxi2016['pickup_latitude']._column, as_column(taxi_zones[0], dtype="int32"), as_column(taxi_zones[1], dtype="int32"), taxi_zones[2]['x']._column, taxi_zones[2]['y']._column) s=s.data_array_view.copy_to_host()
t=pd.DataFrame(s).value_counts().sort_index() pickups = cuspatial.point_in_polygon(taxi2016['pickup_longitude'] , taxi2016['pickup_latitude'], taxi_zones[0], taxi_zones[1], taxi_zones[2]['x'], taxi_zones[2]['y'])

The first 10 rows of t (cython API, result in in bitmap format) 1 605 2 1 4 32 8 34894 16 3 32 36 64 22536 128 139 256 41 512 31502 1024 1725

The first 10 rows of pickups (Python API, result in in Dataframe format) 0 605 1 1 2 32 3 34894 4 3 5 36 6 22536 7 139 8 41 9 31502 10 1725

zhangjianting commented 2 years ago

@hakantekbas I have a new explanation on the performance difference you saw before "If I create seperate shape file for every 31 polygons, it is running 10 times faster and results are correct." Using separate shapefile will only perform point-in-polygon tests for the polygons in each shapefile. However, using taxi_zones[0][start:end] in cuspatial API will perform the tests for ALL the polygons represented by taxi_zones regardless of "start" and "end". Not only the results for the last polygon ("end-1") will be incorrect, the runtime will be much larger if "end"-"start" is smaller comparing to the total number of polygons.
Given the complexity of handling bitmap result from the cython API cuspatial.core.gis.cpp_point_in_polygon, it would be preferable to use python API cuspatial.point_in_polygon. The overhead may be not significant. Sorry for the confusion. As a summary, using something like taxi_zones[0][start:end] will bring unexpected semantic error (although syntactically correct). Using taxi_zones[0] as a whole set (no sub-setting) is required. The issue should be fixed in the documentation and/or API design ASAP.

hakantekbas commented 2 years ago

is there any timeline to fix this issue? It seems this method is one fo the most important methods in api.

harrism commented 2 years ago

Can one of you (@zhangjianting ?) briefly summarize the specific issue? I am having trouble following the long discussion. Is there a reason you can't use cuspatial.quadtree_point_in_polygon.

I feel like we should investigate deprecating the old 31-polygon-limited PiP API if the above is a better replacement.

zhangjianting commented 2 years ago

@harrism The extensive discussions were mostly on the "bug" of cuspatial.point_in_polygon in Python API. It turns out that the notebook example at https://github.com/rapidsai/cuspatial/blob/branch-22.04/notebooks/nyc_taxi_years_correlation.ipynb used a subset of poly_offsets array but the complete poly_ring_offsets array, which made the result incorrect for the last polygon in the subset. poly_offsets must match poly_ring_offsets, which is automatically satisfied when the two arrays are read from shapefile directly.

Neither the python API or the underling C++ API has a logic bug, which was the reason that I used "bug" with quotation marks. It is the incorrect use of the APIs that caused the incorrect result. That being said, we need a way to enforce the matching between poly_offsets and poly_ring_offsets array to give a layer of protection.

A simple solution would be just pass the number of polygons and the number of rings as two additional scalar parameter, instead inferring from the lengths of poly_offsets and poly_ring_offsets arrays (as cudf columns).

The 2nd simple solution is to use the complete poly_offsets and poly_ring_offsets arrays but allows passing a subset of polygon IDs.

The 3rd solution is to use cuDF nested columns so that users will only see one column parameter that includes all information about poly_offsets/poly_ring_offsets/poly_x/poly_y. But this seems to be against our discussions on making cuSpatial independent of cudf.

Yet another simpler solution is just to add documentation and make the requirement on matching between poly_offsets and poly_ring_offsets clear. No change to our code is needed. Users just need to add a single line by calling GeoPandas API to write a subset polygons as a shapefile and then read it back using cuSpatial shapefile reader.

There could be additional alternatives. Which one would you prefer?

To answer your question on why not always use cuspatial.quadtree_point_in_polygon, the spatial-join based APIs need perform quad-tree indexing first which may bring unnecessary overheads and affect performance, when the number of points is not that large. More importantly, quad-tree indexing needs setting at least two parameters (max_level and min_size), which is a little tricky and data dependent and requires users to understand the underlying algorithm to a certain extent. The old (direct) PIP API does not need any such parameters, which is preferable if the number of polygons is no more than 32.

Actually, if we remove the bitvector representation of the direct/old PIP API result, and return (point, polygon) pairs as for cuspatial.quadtree_point_in_polygon, it can allow arbitrary number of polygons. The cost is doubled in this case by needing to add a phase to count the number of pairs before allocating memory and writing outputs in the 2nd phase.

Suggestion?

harrism commented 2 years ago

Let me think about it a bit. ;)

harrism commented 2 years ago

Yet another simpler solution is just to add documentation and make the requirement on matching between poly_offsets and poly_ring_offsets clear. No change to our code is needed. Users just need to add a single line by calling GeoPandas API to write a subset polygons as a shapefile and then read it back using cuSpatial shapefile reader.

I think this is the first step, regardless of what we do to the API.

thomcom commented 2 years ago

I think we can round-trip the above just by putting the selected polygons into a cuGeoDataframe. I'll give a stab at it tomorrow.

zhangjianting commented 2 years ago

@thomcom @harrism It would be great if we can have a Python API like subset_polygon_geodataframe(geo_df,subset_ids or subset_range) to return the same four arrays like read_polygon_shapefile(shp_file). This will leave the C++ API untouched and could be more efficient than saving subset of polygons to shapefile and then read the saved shapefile back (round-trip to disk), especially for large polygon datasets.

harrism commented 2 years ago

Another suggestion, from @isVoid:

https://github.com/rapidsai/cuspatial/pull/497#issuecomment-1082621159

would it be helpful to add an example of input to help demonstrate the use of the API?

harrism commented 2 years ago

@hakantekbas in #497 I improved the documentation of this function in the C++ and Python APIs. We have not yet corrected the notebook or taken further steps to make the API less error-prone, so I'm keeping this issue open.

github-actions[bot] commented 2 years ago

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

github-actions[bot] commented 2 years ago

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

isVoid commented 2 years ago

The extensive discussions were mostly on the "bug" of cuspatial.point_in_polygon in Python API. It turns out that the notebook example at https://github.com/rapidsai/cuspatial/blob/branch-22.04/notebooks/nyc_taxi_years_correlation.ipynb used a subset of poly_offsets array but the complete poly_ring_offsets array, which made the result incorrect for the last polygon in the subset. poly_offsets must match poly_ring_offsets, which is automatically satisfied when the two arrays are read from shapefile directly.

If my understanding of the issue is correct, the inconsistency results from that inputs format is incorrect. And the incorrectness results from that the polygon array is mistakenly sliced (subset). Recent dicussions spawns two separate paths (but converge) to solve this issue.

  1. While it is generally not favorable to introspect the data to determine the validity of the input, because the input format of GIS data is still unstandardized. We tend to create a small "validation" package inside cuspatial and leave it up to user to validate their data before calling the main algorithm.
  2. @thomcom recently refactored the geoseries library and implemented native slicing functionality inside for mixed geometry arrays (that includes polygon arrays) in https://github.com/rapidsai/cuspatial/pull/585. I recommend trying out the slicing method there and use it with your workflow.
github-actions[bot] commented 2 years ago

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.