Apache Arrow - Githubissues

manuzhang commented 7 years ago

In Apache Arrow vs. Parquet and ORC: Do we really need a third Apache project for columnar data representation?, column-store researcher and expert Daniel Abadi summarized fundamental differences between main-memory and disk-resident column-stores as follows

resident	main-memory	disk
example	Arrow	Parquet, ORC
where to optimize	CPU (vectorized processing)	transfer-bandwidth
compression requirement	low-overhead	high-compression ratio
difference between random access and sequential access	small	big

Arrow author, Wes Mckinney (also Pandas creator, Parquet PMC) was disappointed that the full aspects distinguishing Arrow from Parquet and ORC are not acknowledged in Some comments to Daniel Abadi's blog about Apache Arrow.

What Arrow provides a Parquet or ORC columnar storage user is a standardized in-memory data structure to place decoded data

Its serialization-free, zero-copy data access is utilized by PySpark to support Vectorized UDFs

guozhangwang commented 7 years ago

Interesting read.

Do you know what's the difference between Arrow and Tachyon (renamed Alluxio)? My understanding is that Arrow is one layer up in the stack that it is the storage engine and Tachyon is FS underneath it. Is that right?

manuzhang commented 7 years ago

Tachyon is FS and interface for different FSs (HDFS, S3, etc) and media (memory, SSD, HDD, etc) which makes it easier to manage heterogeneous storages across DCs. Arrow is a memory format enabling different systems to talk without overhead (like Kafka, or Linux Pipe).

manuzhang / read-it-now

Apache Arrow #17