Open manuzhang opened 7 years ago
Interesting read.
Do you know what's the difference between Arrow and Tachyon (renamed Alluxio)? My understanding is that Arrow is one layer up in the stack that it is the storage engine and Tachyon is FS underneath it. Is that right?
Tachyon is FS and interface for different FSs (HDFS, S3, etc) and media (memory, SSD, HDD, etc) which makes it easier to manage heterogeneous storages across DCs. Arrow is a memory format enabling different systems to talk without overhead (like Kafka, or Linux Pipe).
In Apache Arrow vs. Parquet and ORC: Do we really need a third Apache project for columnar data representation?, column-store researcher and expert Daniel Abadi summarized fundamental differences between main-memory and disk-resident column-stores as follows
Arrow author, Wes Mckinney (also Pandas creator, Parquet PMC) was disappointed that the full aspects distinguishing Arrow from Parquet and ORC are not acknowledged in Some comments to Daniel Abadi's blog about Apache Arrow.
Its serialization-free, zero-copy data access is utilized by PySpark to support Vectorized UDFs