Generic SQLlite database generation

netsensei commented 7 years ago

The code contains logic to create SQLite databases from CSV files. These db's are used by catmandu's lookup_in_store() fix to add extra data to records.

The logic currently in the tool is hardwired towards a specific museum use case. It contains specific file names, etc.

There are several things we need to consider here:

Can we make the logic generic? i.e. build once, use for multiple institutions, cases where a lookup_in_store requires input?
Should this be a separate command that should be run up front?
Maybe this could be part of a "preprocess" step which happens in a separate, isolated, loose-coupled part of the logic before we start processing the records themselves?
Hardcoding specific filenames makes an assumption about how lookup_in_store is implemented in the fixes. If this part of the logic is changed, the fixes break, and vice versa. Maybe those should be extracted to a configuration file?

pieterdp commented 7 years ago

The logic is, I think, as generic as possible: PIDS.pm can be used to create an SQLite table from any CSV that is stored on a CloudFiles instance. The museum-specific importers are obviously specific for that use case. We could try to abstract out CloudFiles and accept any generic locally/remotely stored CSV file.
I feel it's part of the museum-specific import. Not every institution (e.g. ones that use RKD or AAT directly in their CMS) will need them.
They should run before records are processed.
"We don't break userspace" ;-) The location of the temporary tables should be noted in the documentation of the museum-specific module however.

pieterdp commented 7 years ago

Fixed in Datahub-Factory-Arthub.

thedatahub / Datahub-Factory

Generic SQLlite database generation #5