harsha2010 / magellan

Geo Spatial Data Analytics on Spark
Apache License 2.0
534 stars 149 forks source link

Can magellan handle large shapefiles (1M+ polgons)? #127

Closed dphayter closed 6 years ago

dphayter commented 6 years ago

Thanks for a great geospatial library :-)

I've been trying to load in some large reference shapefiles (1M+ polygons per file) but with no success. The schema is read in ok, but no data is returned with rooftops.show() I've tried increasing Spark memory allocation, but will no joy. Any pointers to where the issue maybe / methods I should debug? Is there anyway to only read n polygons per file?

val rooftops = sqlcontext.read.format("magellan").load(../shape/") rooftops.printSchema
rooftops.show()

Many thanks David

harsha2010 commented 6 years ago

@dphayter can you share the shape file dataset? also how many nodes are you using? what is your cluster configuration like?

dphayter commented 6 years ago

Unfortunately due to licensing I can't share the shape file :-( I'm running on a 20 node YARN cluster (total available memory: 2.6 TB) via Zeppelin.

Are their any large open-source shape files (similar size) that we could use instead?

Thanks

harsha2010 commented 6 years ago

@dphayter on average how many edges do the polygons in each file have? and how big is each file in terms of bytes? i'll see if i can find a similar shape file open source

dphayter commented 6 years ago

approx. 50% of polygons have 4 edges / 50% of polygons have 6-8 edges. At present I'm just trying to read in one shapefile. The file is 334,134,120 bytes and has a FID count of 1.697 million

harsha2010 commented 6 years ago

@dphayter thanks! I am looking into this issue now. Basically right now we read an entire shape file into memory. I am trying to see if there is a sensible way to split a shape file so it can be streamed in. Will update the thread in a day or so with a conclusion

dphayter commented 6 years ago

Any progress/thoughts on how to split the shapefile? I've seen tools like QGIS have a python utility 'ShapefileSplitter' or R package ShapePattern function shpsplitter, but it would be nice to be able to split the shape file on sqlcontext.read.format("magellan").load Thanks

Charmatzis commented 6 years ago

@dphayter Hi, shapefile as a spatial data type has limitations, like it can not exceed size larger that 2GB. http://support.esri.com/en/technical-article/000010813

There is a limitation on shapefile component file size, not on the shapefile size. All component files of a shapefile are limited to 2GB each.

Accordingly, the .dbf cannot exceed 2GB, and the .shp cannot exceed 2GB. These are the only component files that are likely to approach the 2GB limit.

So, it is too small not to fit in a cluster.

A nice turn around is to load the Shapefile to Spark is using jts.

(Snapshoot for a project that I am developing)


def readSimpleFeatures(path: String) = {
    // Extract the features as GeoTools 'SimpleFeatures'
    val url = s"file://${new File(path).getAbsolutePath}"
    val ds = new ShapefileDataStore(new URL(url))
    val ftItr: SimpleFeatureIterator = ds.getFeatureSource.getFeatures.features

    try {
      val simpleFeatures = mutable.ListBuffer[SimpleFeature]()
      while(ftItr.hasNext) simpleFeatures += ftItr.next()
      simpleFeatures.toList
    } finally {
      ftItr.close
      ds.dispose
    }
  }

def readPolygonFeatures(path: String): Seq[MultiPolygonFeature[Map[String,Object]]] =
    readSimpleFeatures(path)
      .flatMap { ft => ft.geom[jts.Polygon].map(PolygonFeature(_, ft.attributeMap)) }

 def shpUrlsToRdd(urlArray: Array[String], partitionSize: Int = 10)(implicit sc: SparkContext): RDD[Polygon ] = {
    val urlRdd: RDD[String] = sc.parallelize(urlArray, (urlArray.size * partitionSize).toInt)
    urlRdd.mapPartitions { urls =>
      urls.flatMap { url =>
        readPolygonFeatures(url).map(x=>x.geom)
      }
    }
  }```
harsha2010 commented 6 years ago

@dphayter @Charmatzis I actually had a branch where i did this, but seem to have accidentally deleted it. Talking to GitHub support to see if I can recover it.. in any case, this shouldn't take too long to fix and I will have a PR for this shortly @dphayter would be great to have you try it out since I couldn't find really large shape files

harsha2010 commented 6 years ago

@Charmatzis also the issue is not so much that a shape file cannot exceed 2GB but the 2GB shape file is being read by a single core... so its not really using much parallelism. We can fix it by using the .SHX index file to figure out how to split the shape files so it can be read in parallel.

dphayter commented 6 years ago

Thanks both. @harsha2010 let me know when you have a PR and I'll test it out. Thanks

harsha2010 commented 6 years ago

@dphayter , thanks to the fine people at Github support i was able to recover the branch ! https://github.com/harsha2010/magellan/pull/146 Can you test this out and let me know if it solves your problem?

dphayter commented 6 years ago

Good news, #146 worked for me!

val rooftops = sqlcontext.read.format("magellan").load(../shape/") rooftops.printSchema rooftops.show(1000)

Note: My .Shp (file size 334,134,120 bytes / FID count of 1.697 million) also had a corresponding .Shx

I'll do more detailed join testing shortly

Many thanks David

harsha2010 commented 6 years ago

Resolved by #146