manuzhang / read-it-now

Don't read it later; read it now
4 stars 0 forks source link

Apache Arrow #17

Open manuzhang opened 7 years ago

manuzhang commented 7 years ago

In Apache Arrow vs. Parquet and ORC: Do we really need a third Apache project for columnar data representation?, column-store researcher and expert Daniel Abadi summarized fundamental differences between main-memory and disk-resident column-stores as follows

resident main-memory disk
example Arrow Parquet, ORC
where to optimize CPU (vectorized processing) transfer-bandwidth
compression requirement low-overhead high-compression ratio
difference between random access and sequential access small big

Arrow author, Wes Mckinney (also Pandas creator, Parquet PMC) was disappointed that the full aspects distinguishing Arrow from Parquet and ORC are not acknowledged in Some comments to Daniel Abadi's blog about Apache Arrow.

What Arrow provides a Parquet or ORC columnar storage user is a standardized in-memory data structure to place decoded data

Its serialization-free, zero-copy data access is utilized by PySpark to support Vectorized UDFs

guozhangwang commented 7 years ago

Interesting read.

Do you know what's the difference between Arrow and Tachyon (renamed Alluxio)? My understanding is that Arrow is one layer up in the stack that it is the storage engine and Tachyon is FS underneath it. Is that right?

manuzhang commented 7 years ago

Tachyon is FS and interface for different FSs (HDFS, S3, etc) and media (memory, SSD, HDD, etc) which makes it easier to manage heterogeneous storages across DCs. Arrow is a memory format enabling different systems to talk without overhead (like Kafka, or Linux Pipe).