microsoft / Trill

Trill is a single-node query processor for temporal or streaming data.
MIT License
1.25k stars 132 forks source link

Support for Apache Arrow #51

Open MedAnd opened 5 years ago

MedAnd commented 5 years ago

Support for Apache Arrow which is a cross-language development platform for in-memory data. It specifies a standardized language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware. It also provides computational libraries and zero-copy streaming messaging and interprocess communication.

Apache Arrow is backed by key developers of 13 major open source projects, including Calcite, Cassandra, Drill, Hadoop, HBase, Ibis, Impala, Kudu, Pandas, Parquet, Phoenix, Spark, and Storm making it the de-facto standard for columnar in-memory analytics.

Related issues:

apscomp commented 5 years ago

I would like to second this... as apache arrow is slowly becoming mainstream.

AlgorithmsAreCool commented 4 years ago

So Arrow is definitely gaining steam, but how would an arrow integration look for Trill?

et say they convert the internal columnar format to Arrow. Trill is still going to be mutating those structures constantly since it is an incremental platform, how would an integration work with those internal structures safely and what would it do with them?

AlgorithmsAreCool commented 4 years ago

A few months on, I can answer my own questions here.

The most obvious point is to open the door to very high performance interop with other applications or runtimes. Someone could write a database plugin that uses Trill operations to compute data.

Furthermore, bulk Arrow structures could be stored directly on disk and accessed via memory mapping. Or systems can make use of standardized readers to import CSVs or Parquet files into arrow structures.

Overall, using a standardized memory layout allows Trill to be integrated more easily and more efficiently with a wider array of projects. It is a very enticing benefit.