locationtech-labs / geopyspark

GeoTrellis for PySpark
Other
179 stars 59 forks source link

Use Apache Arrow to Standardize Data and Create an Interface Between Python and Scala #649

Open jbouffard opened 6 years ago

jbouffard commented 6 years ago

In order to send data back and forth between Python and Scala, we use Google's Protocol Buffers to serialize/deserialize data as it goes from language to another. However, this causes a lot of overheard which slows down performance due to each piece of data being converted to its respective representation in the target language. This makes it so that certain features available only in Python (scikit-learn, matplotlib, OpenCV-Python, etc.) cannot be fully taken advantage of as the performance hit needed to make the data usable is too great. In addition, certain TiledRasterLayer methods aren't as performant as they could be due to their requirement of serializing/deserializing the data in order to perform their operation.

One way around this would be to use Apache Arrow to provide an interface between Python and Scala. The main draw of this platform is that it provides a language agnostic way of storing data in a columnar memory format. This allows for Python and Scala to access the same data in memory without doing any copy-reads which will eliminate the need for serialization/deserialization.

Currently, Apache Arrow is state where we can begin testing it in GeoPySpark, and it's something that we should do. However, we probably won't be seeing the full benefits of Arrow until this issue is resolved. Even with that issue, though, I feel that we should still increases in performance.

jbouffard commented 6 years ago

Here's a gist that I'm using as a scratch-pad to figure out how to incorporate PyArrow into GPS https://gist.github.com/jbouffard/1bafad67c3cd13325c6f89accd5bf1b6