The Dockerfile-spark container includes both Spark and Hadoop, but (as I understand it) the latter is used only to connect Spark with Elasticsearch via the loader container. Thus, we have no way to persist the output from Spark on disk (since in cluster mode, Spark requires an HDFS).
By adding steps to Dockerfile-spark to configure and launch Hadoop with an external volume, I think we could do the following:
Correction: we are using Spark in standalone mode, which does not require HDFS. But it does require that all nodes in the cluster have access to the same storage, which can be NFS, HDFS, or S3. Since we meet this condition with the /dataset_loading volume, I think we would need to configure the /storage volume in a similar fashion. Doing so would support the following use cases:
Use the loader to create and store the full extracts at time of ingest.
Use Spark jobs to create user-defined extracts more efficiently.
The
Dockerfile-spark
container includes both Spark and Hadoop, but (as I understand it) the latter is used only to connect Spark with Elasticsearch via theloader
container.Thus, we have no way to persist the output from Spark on disk (since in cluster mode, Spark requires an HDFS).By adding steps toDockerfile-spark
to configure and launch Hadoop with an external volume, I think we could do the following:Correction: we are using Spark in standalone mode, which does not require HDFS. But it does require that all nodes in the cluster have access to the same storage, which can be NFS, HDFS, or S3. Since we meet this condition with the
/dataset_loading
volume, I think we would need to configure the/storage
volume in a similar fashion. Doing so would support the following use cases: