sul-dlss / sul_pub

SUL system for harvest and managing publications for Stanford CAP, with controlled API access.
http://cap.stanford.edu
Other
8 stars 3 forks source link

Author identifiers are not unique #546

Open atz opened 6 years ago

atz commented 6 years ago

We rely on sunetid, university_id, and california_physician_license to be unique (e.g. in cap_profile_id_rewriter), but in our production data, they are not:

Author.select(:id, :sunetid).group(:sunetid).having("count(*) > 1").to_a.count
=> 3893
Author.select(:id, :university_id).group(:university_id).having("count(*) > 1").to_a.count
=> 4315
Author.select(:id, :california_physician_license).group(:california_physician_license).having("count(*) > 1").to_a.count
=> 1556

I thought this might be attributable to duplicate profiles that were flagged inactive, but that does not seem to be entirely the case:

Author.select(:id, :sunetid).where(active_in_cap:true).group(:sunetid).having("count(*) > 1").to_a.count
=> 1116
Author.select(:id, :sunetid).where(active_in_cap:false).group(:sunetid).having("count(*) > 1").to_a.count
=> 127

We have code depending on data that will break its logic. We need to either fix the data or rework/abandon the code.

atz commented 6 years ago

We concluded in today's standup to defer completion of any updates to the rewriter (including any provisions for cap_profile_id uniqueness) until such time as we actually need to run it again.

Existing refactor and cleanup will be merged, including a raise effectively blocking use of the class.