sul-dlss-deprecated / rialto-etl

ETL tools for RIALTO, Stanford Libraries' research intelligence project
https://library.stanford.edu/projects/rialto
Apache License 2.0
3 stars 0 forks source link
etl gem infrastructure traject-application

RIALTO-ETL

Travis Maintainability Test Coverage Documentation API Apache 2.0 License

RIALTO-ETL is a set of ETL tools for RIALTO, Stanford Libraries' research intelligence project

Dependencies

Usage

Pipeline to harvest organizations from Stanford Profiles API into RIALTO Core

exe/extract call StanfordOrganizations > organizations.json
exe/transform call StanfordOrganizations -i organizations.json > organizations.sparql
exe/load call Sparql -i organizations.sparql

Pipeline to harvest researchers from Stanford Profiles API into RIALTO Core

Notes:

exe/extract call StanfordResearchers > researchers.ndj
exe/transform call StanfordPeople -i researchers.ndj > researchers.sparql
exe/load call Sparql -i researchers.sparql

Composite ETL

The composite ETL tools allow you to streamline operations by running extracts, transforms, and loads on batches of data. These tools are available for grants and publications, currently.

Pipeline to harvest grants from Stanford SeRA API into RIALTO Core

Notes:

exe/transform call StanfordPeopleList -i researchers.ndj > researchers.csv
exe/grants load -s 3 -i researchers.csv

See the output of exe/grants help load to see more of the available CLI options

Pipeline to harvest publications from Web of Science API into RIALTO Core

Notes:

exe/publications load -d ../rialto-sample-data/publications -o data/transformed_publications

See the output of exe/publications help load to see more of the available CLI options

Authentication

If you are using the StanfordResearchers or StanfordOrganizations extract methods, you will first need to obtain a token for the CAP API and set the Settings.cap.api_key value to this token. To set this value, either set an environment variable named SETTINGS__CAP__API_KEY or add the value for this to config/settings.local.yml (which is ignored under version control and should never be checked in), like so:

cap:
  api_key: 'foobar'

Similarly, if you are using the SPARQL writer, then you need to set SETTINGS__SPARQL_WRITER__API_KEY or:

sparql_writer:
  api_key: 'key' # SPARQL Proxy API key

Tokens are stored in shared_configs.

Run the extract process

Run exe/extract to run a named extractor and print output to STDOUT:

$ exe/extract call StanfordResearchers
{"count":10,"firstPage":true,"lastPage":false,"page":1,"totalCount":29089,"totalPages":2909,"values":[{"administrativeAppointments":[...

List registered extract processes

Run exe/extract list to print out the list of callable extractors.

Transform

Run exe/transform to run a named transformer, based on Traject, on a named input file and print output to STDOUT:

$ exe/transform call StanfordOrganizationsToVivo -i stanford_organizations.json
{"@id":"http://authorities.stanford.edu/orgs#vice-provost-for-undergraduate-education/stanford-introductory-studies/freshman-and-sophomore-programs","@type":"http://vivoweb.org/ontology/core#Division","rdfs:label":"Freshman and Sophomore Programs","vivo:abbreviation":["FFQH"]}

Run exe/transform list to print out the list of callable transformers.

Load

Run exe/load to run a named extractor and print output to STDOUT:

$ exe/load call Sparql -i whatever.sparql
...

Configuration

RIALTO-ETL uses the config gem to manage configuration, allowing for flexible variation of configs between environments and hosts. By default, the gem assumes it is running in the 'production' environment and will look for its configurations per the config gem documentation. To explicitly set the environment to test or development, set an environment variable named ENV.

Help

$ exe/extract help
Commands:
  extract call NAME       # Call named extractor (`extract list` to see available names)
  extract help [COMMAND]  # Describe subcommands or one specific subcommand
  extract list            # List callable extractors

$ exe/transform help
Commands:
  transform call NAME       # Call named transformer (`transform list` to see available names)
  transform help [COMMAND]  # Describe subcommands or one specific subcommand
  transform list            # List callable transformers

$ exe/load help
Commands:
  load call NAME -i, --input-file=FILENAME  # Call named loader (` list` to see available names)
  load help [COMMAND]                       # Describe available commands or one specific command
  load list                                 # List callable loaders

Documentation

Development

After checking out the repo, run bin/setup to install dependencies. Then, run rake spec to run the tests. You can also run bin/console for an interactive prompt that will allow you to experiment.

To install this gem onto your local machine, run bundle exec rake install. To release a new version, update the version number in version.rb, and then run bundle exec rake release, which will create a git tag for the version, push git commits and tags, and push the .gem file to rubygems.org.

Sample Data

The sample data we use to work with Rialto::Etl is contained in a private GitHub repository

Contributing

Bug reports and pull requests are welcome on GitHub at https://github.com/sul-dlss/rialto-etl.