Are we supporting only Hadoop clusters?

ihodes commented 9 years ago

Doing so would

1) Reduce the number of people easily able to deploy Cycledash, BUT 2) Simplify our code, especially with 3) Our (inevitable?) move to using Impala 4) Building in greater support for using Spark, and other Hadoop tools

hammer commented 9 years ago

I think we should support more than Hadoop clusters. Hadoop has become quite good at hiding behind standard interfaces (e.g. NFS, ODBC). Let's do what we can to make CycleDash work with these interfaces rather than directly with Hadoop.

jaclynperrone commented 9 years ago

Thanks for adding this one!

ihodes commented 9 years ago

Other issues coming up/need to be addressed in making this decision.

Serving BAMs from a RDBMS will be hard; Impala would work no problem.
We'd like to run workers on the cluster to fix issues with workers in Celery failing on bigger files. We'd need to write workers to run on a non-Hadoop-style system as well if we want to support that.
We use Postgres-specific SQL (\copy, some string processing, others) right now; we'd likely need to use Impala-specific SQL to use Impala (and Impala would have a data-loading setup); supporting both would be tedious.

hammerlab / cycledash

Are we supporting only Hadoop clusters? #682