gisaia / ARLAS-proc

Workaround about data ingestion with computing frameworks
Apache License 2.0
4 stars 0 forks source link

ARLAS-proc

Spark Library to ingest and process geodata timeseries

Table of contents generated with markdown-toc

Overview

ARLAS-proc is a toolbox to transform raw geodata timeseries into enriched movement fragments and trajectories. It is packaged as a scala library for Apache Spark developers.

Prerequisites

Building

Running

Build

JAR

docker run --rm \
        -w /opt/work \
        -v $PWD:/opt/work \
        -v $HOME/.m2:/root/.m2 \
        -v $HOME/.ivy2:/root/.ivy2 \
        gisaia/sbt:1.5.5_jdk8 \
        sbt clean publishLocal

Now, you can add it as a local dependency in your own project

libraryDependencies += "io.arlas" % "arlas-proc" % "X.Y.Z-SNAPSHOT"

Publish SNAPSHOT version to Cloudsmith

If you have sufficient permissions to our Cloudsmith repository, you can publish a SNAPSHOT build jar to Cloudsmith.

You need to set up the following environment variables first:

export CLOUDSMITH_USER="your-user"
export CLOUDSMITH_API_KEY="your-api-key"

docker run --rm \
        -w /opt/work \
        -v $PWD:/opt/work \
        -v $HOME/.m2:/root/.m2 \
        -v $HOME/.ivy2:/root/.ivy2 \
        -e CLOUDSMITH_USER=${CLOUDSMITH_USER} \
        -e CLOUDSMITH_API_KEY=${CLOUDSMITH_API_KEY} \
        gisaia/sbt:1.5.5_jdk8 \
        sbt clean publish

Now, you can add it as a remote dependency in your own project

resolvers += "gisaia-public" at "https://dl.cloudsmith.io/public/gisaia/public/maven/"
libraryDependencies += "io.arlas" % "arlas-proc" % "X.Y.Z-SNAPSHOT"

Release

If you have sufficient permissions on Github repository, simply type:

docker run -ti \
        -w /opt/work \
        -v $PWD:/opt/work \
        -v $HOME/.m2:/root/.m2 \
        -v $HOME/.ivy2:/root/.ivy2 \
        -e CLOUDSMITH_USER=${CLOUDSMITH_USER} \
        -e CLOUDSMITH_API_KEY=${CLOUDSMITH_API_KEY} \
        gisaia/sbt:1.5.5_jdk8 \
        sbt clean release

You will be asked for the versions to use for release & next version.

A jar artifact tagged in the released version will be automatically published to Cloudsmith.

User guide

Add ARLAS-proc dependency

To enable the retrieval of ARLAS-proc via sbt, add our Cloudsmith repository in your build.sbt file.

resolvers += "gisaia-public" at "https://dl.cloudsmith.io/public/gisaia/public/maven/"

Specify ARLAS-proc dependency in the dependencies section of your build.sbt file by adding the following line.

libraryDependencies += "io.arlas" % "arlas-proc" % "X.Y.Z"

Test locally through Jupyter Notebook

Open the link proposed in terminal to open Jupyter Notebook in a browser: http://127.0.0.1:8888/?token=...

Jupyter

Open demo_notebook.ipynb to run the tutorial notebook.

Test locally through spark-shell

Start an interactive spark-shell session. For example :

Tutorial with boat location data

This tutorial applies a processing pipeline to vessel location records to extract the real boats trajectories.

We use a sample of AIS (Automatic Identification System) data provided by the Danish Maritime Authority, in accordance with the conditions for the use of Danish public data.

We process the records emitted by two vessels on the 20th of November 2019.

Paste (using :paste) the following code snippets in the spark-shell

It also transforms numeric column by taking the average of the fragment observations (ex: "SOG" -> "arlas_track_sog")

val fragment_data = static_filled_data.process(
  new FlowFragmentMapper(dataModel,
    spark,
    aggregationColumnName = dataModel.idColumn,
    averageNumericColumns = List("SOG", "COG", "Heading"))
)
fragment_data.sort("MMSI", "arlas_timestamp").show()

Running tests

Run test suite

docker run -ti \
        -w /opt/work \
        -v $PWD :/opt/work \
        -v $HOME/.m2:/root/.m2 \
        -v $HOME/.ivy2:/root/.ivy2 \
        -e CLOUDSMITH_USER=${CLOUDSMITH_USER} \
        -e CLOUDSMITH_API_KEY=${CLOUDSMITH_API_KEY} \
        gisaia/sbt:1.5.5_jdk8 \
        sbt clean test

Unit tests relying on external API

External APIs are mocked using Wiremock. Wiremock has 2 benefits:

Capture external API

Download the standalone JAR from http://repo1.maven.org/maven2/com/github/tomakehurst/wiremock-standalone/2.25.1/wiremock-standalone-2.25.1.jar and save it into the src/test/resources/wiremock folder.

Launch the JAR by replacing https://external.api.com with your own API:

java -jar wiremock-standalone-2.25.1.jar --verbose --proxy-all="https://external.api.com" --record-mappings

Then in order to save the API results, change the API url to http://localhost:8080 within the requests.

For example, to save nominatim results, you can do:

java -jar wiremock-standalone-2.25.1.jar --verbose --proxy-all="http://nominatim.services.arlas.io" --record-mappings
curl "http://localhost:8080/reverse.php?format=json&lat=41.270568&lon=6.6701225&zoom=10"

The results will be saved into the resources folder, which is used by scala tests.

Use mock server from scala tests

A test class can extend the trait ArlasMockServer, which automatically starts and stops the mock server.

Contributing

Please read CONTRIBUTING.md for details on our code of conduct, and the process for submitting us pull requests.

Authors

See also the list of contributors who participated in this project.

License

This project is licensed under the Apache License, Version 2.0. See LICENSE.txt for details.

Acknowledgments

This project has been initiated and is maintained by Gisaïa