kids-first / kf-portal-etl

:factory: Extract-Transform-Load Pipeline for producing data for the Kids First Data Resource Portal
Apache License 2.0
5 stars 3 forks source link

Kids First Portal ETL

kf-portal-etl

The Kids-First ETL is built on Scala, Spark, Elasticsearch.

Dependencies

Before building this application, the following dependencies must be built and added to your local maven (.m2) directory.

ES Model

  1. Clone from repository

    git clone git@github.com:kids-first/kf-es-model.git
  2. Maven install

    cd kf-es-model
    mvn install

Build

To build the application, run the following from the command line in the root directory of the project

sbt ";clean;assembly"

Then get the application jar file from

${root of the project}/target/scala-2.11/kf-portal-etl.jar

Configuration

The Kids-First ETL uses lightbend/config as the configuration library. kf_etl.conf defines all of the configuration objects used by ETL.

The top namespace of the configuration is io.kf.etl. To refer to any specific configuration object, please prefix the namespace string.

The ETL has a system environment variable called kf.etl.config which define the path to the configuration file.

If kf.etl.config is not provided when the application is submitted to Spark, the ETL will search the root of the class path for the default file with the name kf_etl.conf, otherwise the application will quit.

Running the Application

There are some dependencies to run Kids-First ETL, refer to submit.sh.example

To submit the application to Spark, run the following in the command line

${SPARK_HOME}/bin/spark-submit --master spark://${Spark master node name or IP}:7077 --deploy-mode cluster --class io.kf.etl.ETLMain --driver-java-options "-Dkf.etl.config=${URL string for configuration file}" --conf "spark.executor.extraJavaOptions=-Dkf.etl.config=${URL string for configuration file}" ${path to kf-portal-etl.jar} ${command-line arguments}

Command line arguments

Kids-First ETL supports command-line argument -study_id id1 id2. In this case, ETL will filter the dataset retrieved from data service through study_ids.

The third command-line argument is -release_id rid

To submit the application with study_ids, run the following:

${SPARK_HOME}/bin/spark-submit --master spark://${Spark master node name or IP}:7077 --deploy-mode cluster --class io.kf.etl.ETLMain --driver-java-options "-Dkf.etl.config=${URL string for configuration file}" --conf "spark.executor.extraJavaOptions=-Dkf.etl.config=${URL string for configuration file}" ${path to kf-portal-etl.jar} -study_id id1 id2 -release_id rid