harsha2010 / magellan

Geo Spatial Data Analytics on Spark
Apache License 2.0
534 stars 149 forks source link

How to debug index usage crashes? #220

Open laurikoobas opened 6 years ago

laurikoobas commented 6 years ago

My code was successfully running with 350 million points and 300 polygons. Now the number of polygons went up to 450 and it started crashing. I did some tests and it still crashes with 10 points (not 10 million, just 10) and those 450 polygons. It's still fine if I limit the number of polygons to 300 though.

Right now I just disabled the index use, but I'd like to get to the root of the issue. Could the problem be in a weird polygon? The largest polygon we have has 174 points.

During my tests, these were some of the error messages:

WARN BlockManagerMasterEndpoint: No more replicas available for rdd_77_0 ! WARN BlockManagerMasterEndpoint: No more replicas available for rdd_61_0 ! ERROR YarnScheduler: Lost executor 2 on blaah: Container killed by YARN for exceeding memory limits. 5.5 GB of 5.5 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead. ... java.lang.OutOfMemoryError: Java heap space

harsha2010 commented 6 years ago

@laurikoobas how big a cluster are you using? and what is the node configuration? if you can share the polygon dataset it would be easier to debug this.. otherwise one thing you can do is collect a heap dump during the execution and send it over

laurikoobas commented 6 years ago

Running it as an AWS Glue Job on 40 DPUs. It makes sense that the polygon dataset is the cause of this, but I can't share it. What would be something in the polygons that would make the index use an issue though?

harsha2010 commented 6 years ago

I'm not familiar with Glue, but I think the amount of memory you need for these polygons might be tipping you over the 5GB limit you have set for the YARN job... what index precision are you using?

laurikoobas commented 6 years ago

Used just the 30 that's in the example. Do you have guidelines or documentation on what it means and which values make sense for which use cases?

harsha2010 commented 6 years ago

You want to pick a precision that can eliminate a large fraction of polygons..eg if your polygons are US states and you pick say precision of 10/15 each polygon roughly falls into O(1) grids at that precision

If you pick precision 30 that still holds true but we not spend more time computing the grids that overlap with the polygon and more space storing those grids since there will be a lot more of them now Each time you subdivide you get 4x more grids so if you pick too fine a precision you will pay for it in storage and time

harsha2010 commented 6 years ago

precision is nothing but the geohash precision https://gis.stackexchange.com/questions/115280/what-is-the-precision-of-a-geohash

instead of characters, we are using the bit size (so to convert to geohash character length simply divide by 5). eg, precision of 35 = 7 character geohash