thedatahub / Datahub-Factory

Datahub::Factory - Transport metadata between Collection Management Systems and the Datahub
Other
2 stars 4 forks source link

Generic SQLlite database generation #5

Closed netsensei closed 7 years ago

netsensei commented 7 years ago

The code contains logic to create SQLite databases from CSV files. These db's are used by catmandu's lookup_in_store() fix to add extra data to records.

The logic currently in the tool is hardwired towards a specific museum use case. It contains specific file names, etc.

There are several things we need to consider here:

  1. Can we make the logic generic? i.e. build once, use for multiple institutions, cases where a lookup_in_store requires input?
  2. Should this be a separate command that should be run up front?
  3. Maybe this could be part of a "preprocess" step which happens in a separate, isolated, loose-coupled part of the logic before we start processing the records themselves?
  4. Hardcoding specific filenames makes an assumption about how lookup_in_store is implemented in the fixes. If this part of the logic is changed, the fixes break, and vice versa. Maybe those should be extracted to a configuration file?
pieterdp commented 7 years ago
  1. The logic is, I think, as generic as possible: PIDS.pm can be used to create an SQLite table from any CSV that is stored on a CloudFiles instance. The museum-specific importers are obviously specific for that use case. We could try to abstract out CloudFiles and accept any generic locally/remotely stored CSV file.
  2. I feel it's part of the museum-specific import. Not every institution (e.g. ones that use RKD or AAT directly in their CMS) will need them.
  3. They should run before records are processed.
  4. "We don't break userspace" ;-) The location of the temporary tables should be noted in the documentation of the museum-specific module however.
pieterdp commented 7 years ago

Fixed in Datahub-Factory-Arthub.