sul-dlss / sul_pub

SUL system for harvest and managing publications for Stanford CAP, with controlled API access.
http://cap.stanford.edu
Other
8 stars 3 forks source link

Periodically update WoS records #1594

Open peetucket opened 1 year ago

peetucket commented 1 year ago

See also #87 and #179 which are related

Web of Science periodically fixes data problems, for example, typos and problems in identifiers, such as DOIs. Since we never update our source data once harvested, those typos and problems with DOIs will remain forever. This results, for example, in broken DOI links (of which we have ~1400 as of April 2023).

Since we used previously harvested publication data if possible when adding the same publication to a new author Profile, it will still have the old cached data even if the a new author harvests that previously harvested publication.

This task would involve periodically pulling updated WoS data for all our publications and re-updating our cached data (source records and pub-hash) or at least some portion of it (such as just the identifiers).

Note, this could have side-effects, as we would be changing data for publications already harvested, which could

  1. change how they appear on user's profiles, even after approved
  2. cause larger than expected nightly change updates when the Profiles API updates publications via our API

Note that some data problems are likely never fixed in Web of Science source records, and this work would thus have no impact on records with persistent bad metadata.