Compile-time data processing

In systems like Spark and FlumeJava, there is a templated, or generic, data structure representing a distributed collection of objects. In Spark, it's called an RDD; In Flume, a PCollection. Furthermore, there are operations for transforming one collection into another. C++ code written in this style would look something like this:

RDD<int> xs = CreateRddFromFile("hdfs://numbers.txt");
RDD<int> evens = xs.filter([](const int x) { return x % 2 == 0; });
RDD<string> ss = xs.map([](const int x) { return std::to_string(x); });
int num = ss.count();

There are a couple things about this code worth noting. First, it leverages the type system of the programming language in which it's written. For example, if we tried to filter an RDD of ints with a function that took bools, the code wouldn't compile:

RDD<int> xs = CreateRddFromFile("hdfs://numbers.txt");
RDD<int> evens = xs.filter([](const bool b) { return b; }); // Compile error!

Of course, to leverage the compile time type checking, the program has to be present at compile time! We cannot, for example, read in a query from the command line and execute it. This is contrast to traditional relation databases which can read in SQL queries at runtime and evaluate them.

Runtime query evaluation

As we just mentioned, relational databases have always been able to read in SQL queries dynamically and execute them. Query writers don't have to write a program in a full blown programming language and run it through a compiler.

This doesn't mean that something like a database has any less type safety than something like Spark. It just means that the database now has to design its own type system and enforce it itself. In C++, for example, we could tag void pointers with an enum of types:

enum class Type {
    STRING = 0,
    INT = 1,
    BOOL = 2,
};

struct Data {
    Type type;
    void *data;
};

In a language like OCaml, we could represent values as an algebraic data type:

type data = 
  | String of string
  | Int of int
  | Bool of bool

As another example, here is how Imapala represents types.

Designing the type system, writing a type checker, and operating on data in these formats is much more difficult than leveraging the type system of the underlying programming language.

Which to use?

Frameworks like Spark and FlumeJava leverage the type system of an existing language at the cost of being unable to evaluate queries at runtime. Relational databases can evaluate queries at runtime at the cost of having to implement a type system and type checker by hand. Which should we do?

ucbrise / LatticeFlow