sul-dlss / was_robot_suite

Robots for Web Archiving Service accessioning and dissemination
Other
0 stars 2 forks source link
application infrastructure robot

CircleCI Code Climate Test Coverage

GitHub tagged version

WAS_Robot_Suite

Robot code for accessioning and preservation of Web Archiving Service Seed and Crawl objects.

General Robot Documentation

[Deprecated] Check the Wiki in the robot-master repo.

To run, use the lyber-core infrastructure, which uses bundle exec controller boot to start all robots defined in config/environments/robots_ENV.yml.

Deployment

Various dependencies, including cdxj-indexer which is installed via pip3 and poetry, can be found in config/settings.yml and shared_configs (was-robotsxxx branches). To install cdxj-indexer:

$ poetry install

And then to run it:

$ poetry run cdxj-indexer --args --follow --here

Prerequisites

See below.

Workflows

See consul pages in Web Archival portal, esp Web Archiving Development Documentation

wasCrawlPreassembly

Preassembly workflow for web archiving crawl objects (that include WARC or ARC files) to extract and create metadata. It consists of these robots:

wasCrawlDissemination

Dissemination workflow for web archiving crawl objects. It is kicked off by the last step in the common-accessioning end-accession step that reads the disseminationWF that is suitable for this object type based on APO. It consists of these robots:

wasSeedPreassembly

Preassembly workflow for web archiving seed objects.

It consists of 4 robots:

wasDissemination

Workflow to route web archiving objects to wasCrawlDisseminationWF based on content type. Note that the wasDisseminationWF itself is fired off by the accessionWF by using the administrative.disseminationWorkflow value in the APO. For example, if the APO has the following, it'll fire off wasDisseminationWF:

  "administrative": {
      "disseminationWorkflow": "wasDisseminationWF",

It consists of 1 robot:

Index rollup

There is a scheduled task to roll up the level0.cdxj files into level1 each night, plus additional rollups to level2 and level3, monthly and yearly respectively.

Prerequisites

For thumbnail image creation

  1. Kakadu Proprietary Software Binaries - for JP2 generation
  2. libvips
  3. Exiftool
  4. Puppeteer
  5. Google Chrome

Kakadu

Download and install demonstration binaries from Kakadu: http://kakadusoftware.com/downloads/

NOTE: If you have upgrade to El Capitan on OS X, you will need to donwload and re-install the latest version of Kakadu, due to changes made with SIP. These changes moved the old executable binaries to an inaccessible location.

Libvips

Mac

brew install libvips

Debian/Ubuntu Linux

sudo apt install libvips42

Exiftool

RHEL

Download latest version from: http://www.sno.phy.queensu.ca/~phil/exiftool

tar -xf Image-ExifTool-#.##.tar.gz
cd Image-ExifTool-#.##
perl Makefile.PL
make test
sudo make install

Puppeteer

yarn install

Reset Process (For QA/Stage)

Steps

  1. Verify there are no jobs on the was-robots at https://robot-console-stage.stanford.edu/busy
  2. Clear collections: rm -rf /web-archiving-stacks/data/collections/*
  3. Clear indexes: rm -rf /web-archiving-stacks/data/indexes/*
  4. Clear seeds: rm -rf /was_unaccessioned_data/seed/*
  5. Clear jobs: rm -rf /was_unaccessioned_data/jobs/*