nasa-jpl-memex / sce

Sparkler Crawl Environment - a packaged, dockerized version of http://github.com/USCDataScience/sparkler.git
http://irds.usc.edu/sparkler/
Apache License 2.0
4 stars 3 forks source link
apache apache-spark crawler docker docker-image information-retrieval web-crawler

Sparkler Crawl Environment

The Sparkler Crawl Environment aims at providing an efficient, scalable, consistent and reliable software architecture consisting of domain discovery tools able to enrich a given domain by expanding the collection of artifacts that define the domain.

This repository, named sce, provides a command-line utility for building Sparkler Crawl Environment as a multi-container Docker application running through the Docker Compose tool on a single node. As a PoC, you can easily install the Sparkler Crawl Environment on a single node using the kickstart.sh bash script that automatically builds and starts up all the software components:

./kickstart.sh [-l /path/to/log]

The Sparkler Crawl Environment is built on top of Sparkler, a new web crawler that makes use of recent advancements in distributed computing and information retrieval domains by conglomerating various Apache projects like Spark, Kafka, Lucene/Solr, Tika, and Felix. Sparkler is an extensible, highly scalable, and high-performance web crawler that is an evolution of Apache Nutch and runs on Apache Spark Cluster.