project-lux / data-pipeline

Data pipeline to harvest, transform, reconcile, enrich and export Linked Art data for LUX (or other system)
Apache License 2.0
18 stars 1 forks source link

Data Transformation Pipeline Code

Architecture

architecture diagram

Pipeline Components

Future ITS Owned Components:

Pipeline Components:

External Sources: Implementation Status

Source Fetch Map Name Reconcile Load IdxLoad
AAT N/A -
DNB - -
FAST - - - -
Geonames - -
LCNAF -
LCSH
TGN - N/A -
ULAN N/A -
VIAF - -
Who's on First - N/A -
Wikidata -
Japan NL - N/A -
BNF - N/A -
GBIF - N/A -
ORCID - N/A -
ROR - N/A -
Wikimedia API - N/A -
DNB - N/A -
BNE - N/A -
Nomenclature - - - - -
Getty Museum - - - - -
Homosaurus - - - - -
Nomisma - - - - -
SNAC - - - - -

✅ = Done ; - = Not started ; N/A = Can't/Won't be done

Fetching external source dump files

Process:

  1. In the config file, look up dumpFilePath and remoteDumpFile
  2. Go to the directory where dumpFilePath exists and rename it with a date (e.g. latest-2022-07)
  3. execute wget <url> where <url> is the URL from remoteDumpFile (and probably validate it by hand online)
  4. For wikidata, as it's SO HUGE, instead do: nohup wget --quiet <url> & to fetch it in the background so we can get on with our lives in the mean time.
  5. Done :)