go-sif / sif

Sif is a framework for fast, predictable, general-purpose distributed computing in the map/reduce paradigm.
Apache License 2.0
32 stars 3 forks source link

Apache Arrow Support? #9

Open sbinet opened 4 years ago

sbinet commented 4 years ago

hi there,

out of curiosity, I was wondering whether you considered providing support for Apache Arrow. there's a lot of overlap with what it's supposed to (eventually) provide and this project. including "dataframes".

Ghnuberath commented 3 years ago

Sorry, had a baby in the middle there and didn't get to respond to your issue. Thanks so much for an excellent question!

Apache Arrow is something that I actually looked at quite closely when I found out about it, which was somewhat after I had started work on Sif. The short answer is yes, I do think some form of support is a great idea, but the long answer is a little more complicated than that.

In particular, one of Sif's guiding design principles is predictability in memory utilization which is, in part, accomplished through an emphasis on fixed-width data types and an internal DataFrame model fashioned quite specifically around that. Using Arrow's DataFrame format internally would, I think, make it difficult to achieve this design goal to the degree I want to (because it would add features to the Sif DataFrame that I am intentionally not including within the framework). It would also tie Sif's capabilities to the limitations of Arrow going forward, which is not necessarily something I would be in favour of.

The interoperability benefits of Arrow cannot be overstated, though! One of the original ideas for Sif that I haven't actually documented here was the ability to ship Partitions to/from Python. I had planned to accomplish that by writing a Sif Python library which would have a client/server relationship with a running Go-based Sif pipeline - shipping Partitions back and forth for transformation in Pythonland. Today, I would accomplish that by implementing the Arrow Flight protocol and that would save me a ton of work! Sif would continue to use its own DataFrame implementation internally, but would transparently convert to/from Arrow when connecting with another Arrow Flight-capable system.

There's a lot to think about here, though, especially whether or not the Sif DataFrame implementation should have feature parity with Arrow's. I need to do some research to see whether or not Spark and others have committed to that and, if not, how they've handled the discrepancies.

Ghnuberath commented 3 years ago

In terms of timing, there's a few features I want to add to Sif to really "prove it's a good idea" before I'd move on to integrations like Flight. But I'd love to have the conversation and would welcome help thinking about the best approach to including something like that.