Find and group duplicate publications as each new publication is imported

EricDurante commented 4 years ago

Currently, we import all of the new publication metadata from each data source, then we find and group duplicate publications, and then we hide any duplicated publications. This has become problematic for two reasons:

For the period of time between when the new publication metadata is imported and when the new duplicates are detected and hidden, duplicate publication metadata is visible to end users.
This series of three steps that operate on all of the publication metadata in bulk makes it a lot more difficult to completely automate a scheduled import of new publication metadata from each data source.

These two problems would be solved by modifying each publication importer so that they detect and hide duplicates of each new publication record that they import as each new record is imported.

This will require updating the following importers:

activity_insight_importer.rb
pure_publication_importer.rb
web_of_science_file_importer.rb
oai_importer.rb

DanCoughlin commented 3 years ago

Done for activity insight import law school import

EricDurante commented 3 years ago

This is done for Pure now as well. However, we'll still need to make this change for the Web of Science import if we're going to import any more new data from files: https://github.com/psu-stewardship/researcher-metadata/issues/149

psu-libraries / researcher-metadata

Find and group duplicate publications as each new publication is imported #90