project-lux / data-pipeline

Data pipeline to harvest, transform, reconcile, enrich and export Linked Art data for LUX (or other system)
Apache License 2.0
13 stars 1 forks source link

Data Transformation Pipeline Code

Architecture

architecture diagram

Pipeline Components

Future ITS Owned Components:

Pipeline Components:

External Sources: Implementation Status

Source Fetch Map Reconcile Load IdxLoad
AAT N/A N/A -
DNB - -
FAST - - - -
Geonames - N/A -
LCNAF - - -
LCSH -
TGN - N/A -
ULAN - N/A -
VIAF - - -
Who's on First - N/A -
Wikidata
Japan NL - N/A -

✅ = Seems to work ; - = Not started ; N/A = Can't be done

Fetching external source dump files

Process:

  1. In the config file, look up dumpFilePath and remoteDumpFile
  2. Go to the directory where dumpFilePath exists and rename it with a date (e.g. latest-2022-07)
  3. execute wget <url> where <url> is the URL from remoteDumpFile (and probably validate it by hand online)
  4. For wikidata, as it's SO HUGE, instead do: nohup wget --quiet <url> & to fetch it in the background so we can get on with our lives in the mean time.
  5. Done :)