sul-dlss / sul_pub

SUL system for harvest and managing publications for Stanford CAP, with controlled API access.
http://cap.stanford.edu
Other
8 stars 3 forks source link

Duplicate keys errors in web of science source records #831

Closed peetucket closed 6 years ago

peetucket commented 6 years ago

Errors like this keep showing up in the logs during harvesting. We should understand why to see if any mitigation is necessary:


E, [2018-04-18T05:07:47.834058 #21220] ERROR -- : WebOfScience::ProcessRecords: Author: 92207, ProcessRecords failed
E, [2018-04-18T05:07:47.834111 #21220] ERROR -- : #<ActiveRecord::RecordNotUnique: Mysql2::Error: Duplicate entry '351586' for key 'index_web_of_science_source_records_on_publication_id': UPDATE `web_of_science_source_records` SET `publication_id` = 351586, `updated_at` = '2018-04-18 12:07:46' WHERE `web_of_science_source_records`.`id` = 9509>
E, [2018-04-18T05:07:47.834172 #21220] ERROR -- : /opt/app/pub/sul-pub/shared/bundle/ruby/2.3.0/gems/mysql2-0.4.10/lib/mysql2/client.rb:120:in `_query'
/opt/app/pub/sul-pub/shared/bundle/ruby/2.3.0/gems/mysql2-0.4.10/lib/mysql2/client.rb:120:in `block in query'
/opt/app/pub/sul-pub/shared/bundle/ruby/2.3.0/gems/mysql2-0.4.10/lib/mysql2/client.rb:119:in `handle_interrupt'
/opt/app/pub/sul-pub/shared/bundle/ruby/2.3.0/gems/mysql2-0.4.10/lib/mysql2/client.rb:119:in `query'
/opt/app/pub/sul-pub/shared/bundle/ruby/2.3.0/gems/activerecord-4.2.10/lib/active_record/connection_adapters/abstract_mysql_adapter.rb:317:in `block in execute'
/opt/app/pub/sul-pub/shared/bundle/ruby/2.3.0/gems/activerecord-4.2.10/lib/active_record/connection_adapters/abstract_adapter.rb:484:in `block in log'
/opt/app/pub/sul-pub/shared/bundle/ruby/2.3.0/gems/activesupport-4.2.10/lib/active_support/notifications/instrumenter.rb:20:in `instrument'
/opt/app/pub/sul-pub/shared/bundle/ruby/2.3.0/gems/activerecord-4.2.10/lib/active_record/connection_adapters/abstract_adapter.rb:478:in `log'
/opt/app/pub/sul-pub/shared/bundle/ruby/2.3.0/gems/activerecord-4.2.10/lib/active_record/connection_adapters/abstract_mysql_adapter.rb:317:in `execute'
/opt/app/pub/sul-pub/shared/bundle/ruby/2.3.0/gems/activerecord-4.2.10/lib/active_record/connection_adapters/mysql2_adapter.rb:217:in `execute'
/opt/app/pub/sul-pub/shared/bundle/ruby/2.3.0/gems/activerecord-4.2.10/lib/active_record/connection_adapters/mysql2_adapter.rb:238:in `exec_delete'
/opt/app/pub/sul-pub/shared/bundle/ruby/2.3.0/gems/activerecord-4.2.10/lib/active_record/connection_adapters/abstract/database_statements.rb:114:in `update'
/opt/app/pub/sul-pub/shared/bundle/ruby/2.3.0/gems/activerecord-4.2.10/lib/active_record/connection_adapters/abstract/query_cache.rb:14:in `update'
/opt/app/pub/sul-pub/shared/bundle/ruby/2.3.0/gems/activerecord-4.2.10/lib/active_record/relation.rb:88:in `_update_record'
/opt/app/pub/sul-pub/shared/bundle/ruby/2.3.0/gems/activerecord-4.2.10/lib/active_record/persistence.rb:515:in `_update_record'
/opt/app/pub/sul-pub/shared/bundle/ruby/2.3.0/gems/activerecord-4.2.10/lib/active_record/locking/optimistic.rb:79:in `_update_record'
/opt/app/pub/sul-pub/shared/bundle/ruby/2.3.0/gems/activerecord-4.2.10/lib/active_record/attribute_methods/dirty.rb:129:in `_update_record'
/opt/app/pub/sul-pub/shared/bundle/ruby/2.3.0/gems/activerecord-4.2.10/lib/active_record/callbacks.rb:310:in `block in _update_record'
/opt/app/pub/sul-pub/shared/bundle/ruby/2.3.0/gems/activesupport-4.2.10/lib/active_support/callbacks.rb:88:in `__run_callbacks__'
/opt/app/pub/sul-pub/shared/bundle/ruby/2.3.0/gems/activesupport-4.2.10/lib/active_support/callbacks.rb:778:in `_run_update_callbacks'
/opt/app/pub/sul-pub/shared/bundle/ruby/2.3.0/gems/activerecord-4.2.10/lib/active_record/callbacks.rb:310:in `_update_record'
/opt/app/pub/sul-pub/shared/bundle/ruby/2.3.0/gems/activerecord-4.2.10/lib/active_record/timestamp.rb:70:in `_update_record'
/opt/app/pub/sul-pub/shared/bundle/ruby/2.3.0/gems/activerecord-4.2.10/lib/active_record/persistence.rb:504:in `create_or_update'
/opt/app/pub/sul-pub/shared/bundle/ruby/2.3.0/gems/activerecord-4.2.10/lib/active_record/callbacks.rb:302:in `block in create_or_update'
/opt/app/pub/sul-pub/shared/bundle/ruby/2.3.0/gems/activesupport-4.2.10/lib/active_support/callbacks.rb:117:in `call'
/opt/app/pub/sul-pub/shared/bundle/ruby/2.3.0/gems/activesupport-4.2.10/lib/active_support/callbacks.rb:555:in `block (2 levels) in compile'
/opt/app/pub/sul-pub/shared/bundle/ruby/2.3.0/gems/activesupport-4.2.10/lib/active_support/callbacks.rb:505:in `call'
/opt/app/pub/sul-pub/shared/bundle/ruby/2.3.0/gems/activesupport-4.2.10/lib/active_support/callbacks.rb:92:in `__run_callbacks__'
/opt/app/pub/sul-pub/shared/bundle/ruby/2.3.0/gems/activesupport-4.2.10/lib/active_support/callbacks.rb:778:in `_run_save_callbacks'
/opt/app/pub/sul-pub/shared/bundle/ruby/2.3.0/gems/activerecord-4.2.10/lib/active_record/callbacks.rb:302:in `create_or_update'
/opt/app/pub/sul-pub/shared/bundle/ruby/2.3.0/gems/activerecord-4.2.10/lib/active_record/persistence.rb:142:in `save!'
/opt/app/pub/sul-pub/shared/bundle/ruby/2.3.0/gems/activerecord-4.2.10/lib/active_record/validations.rb:43:in `save!'
/opt/app/pub/sul-pub/shared/bundle/ruby/2.3.0/gems/activerecord-4.2.10/lib/active_record/attribute_methods/dirty.rb:29:in `save!'
/opt/app/pub/sul-pub/shared/bundle/ruby/2.3.0/gems/activerecord-4.2.10/lib/active_record/transactions.rb:291:in `block in save!'
/opt/app/pub/sul-pub/shared/bundle/ruby/2.3.0/gems/activerecord-4.2.10/lib/active_record/transactions.rb:351:in `block in with_transaction_returning_status'
/opt/app/pub/sul-pub/shared/bundle/ruby/2.3.0/gems/activerecord-4.2.10/lib/active_record/connection_adapters/abstract/database_statements.rb:211:in `transaction'
/opt/app/pub/sul-pub/shared/bundle/ruby/2.3.0/gems/activerecord-4.2.10/lib/active_record/transactions.rb:220:in `transaction'
/opt/app/pub/sul-pub/shared/bundle/ruby/2.3.0/gems/activerecord-4.2.10/lib/active_record/transactions.rb:348:in `with_transaction_returning_status'
/opt/app/pub/sul-pub/shared/bundle/ruby/2.3.0/gems/activerecord-4.2.10/lib/active_record/transactions.rb:291:in `save!'
/opt/app/pub/sul-pub/releases/20180413162902/app/models/web_of_science_source_record.rb:24:in `block in link_publication'
/opt/app/pub/sul-pub/shared/bundle/ruby/2.3.0/gems/activerecord-4.2.10/lib/active_record/connection_adapters/abstract/database_statements.rb:213:in `block in transaction'
/opt/app/pub/sul-pub/shared/bundle/ruby/2.3.0/gems/activerecord-4.2.10/lib/active_record/connection_adapters/abstract/transaction.rb:184:in `within_new_transaction'
/opt/app/pub/sul-pub/shared/bundle/ruby/2.3.0/gems/activerecord-4.2.10/lib/active_record/connection_adapters/abstract/database_statements.rb:213:in `transaction'
/opt/app/pub/sul-pub/shared/bundle/ruby/2.3.0/gems/activerecord-4.2.10/lib/active_record/transactions.rb:220:in `transaction'
/opt/app/pub/sul-pub/shared/bundle/ruby/2.3.0/gems/activerecord-4.2.10/lib/active_record/transactions.rb:277:in `transaction'
/opt/app/pub/sul-pub/releases/20180413162902/app/models/web_of_science_source_record.rb:22:in `link_publication'
/opt/app/pub/sul-pub/releases/20180413162902/lib/web_of_science/process_records.rb:60:in `block in create_publications'
/opt/app/pub/sul-pub/releases/20180413162902/lib/web_of_science/process_records.rb:55:in `each'
/opt/app/pub/sul-pub/releases/20180413162902/lib/web_of_science/process_records.rb:55:in `create_publications'
/opt/app/pub/sul-pub/releases/20180413162902/lib/web_of_science/process_records.rb:31:in `execute'
/opt/app/pub/sul-pub/releases/20180413162902/lib/web_of_science/harvester.rb:96:in `process_records'
/opt/app/pub/sul-pub/releases/20180413162902/lib/web_of_science/harvester.rb:49:in `process_uids'
/opt/app/pub/sul-pub/releases/20180413162902/lib/web_of_science/harvester.rb:30:in `process_author'
/opt/app/pub/sul-pub/releases/20180413162902/lib/web_of_science/harvester.rb:15:in `block in harvest'
/opt/app/pub/sul-pub/releases/20180413162902/lib/web_of_science/harvester.rb:14:in `each'
/opt/app/pub/sul-pub/releases/20180413162902/lib/web_of_science/harvester.rb:14:in `harvest'
/opt/app/pub/sul-pub/releases/20180413162902/lib/cap/authors_poller.rb:61:in `block in do_wos_harvest'```
peetucket commented 6 years ago

Another example:

E, [2018-04-19T10:37:22.796314 #14635] ERROR -- : #<ActiveRecord::RecordNotUnique: Mysql2::Error: Duplicate entry '450125' for key 'index_web_of_science_source_records_on_publication_id': UPDATE `web_of_science_source_records` SET `publication_id` = 450125, `updated_at` = '2018-04-19 17:37:22' WHERE `web_of_science_source_records`.`id` = 2262>
E, [2018-04-19T10:37:22.796535 #14635] ERROR -- : /opt/app/pub/sul-pub/shared/bundle/ruby/2.3.0/gems/mysql2-0.4.10/lib/mysql2/client.rb:120:in `_query'
atz commented 6 years ago

So in the latter example, the two WOS source records competing to represent the same pub are:

Same DOI, but different UID so different fingerprint. Upstream provider is treating these as separate records.

The pub being linked isn't actually WoS provenance: it's a pubmed record from 30 Aug 2017. So we really only need at most one WoS record to link to it (which we have).

So I interpret this as a question of duplicate detection or just handling the collision error.

peetucket commented 6 years ago

Ok, so the WoS API returns this record twice as you note, but with different UIDs (since it comes from two different databases on their end). Ultimately as long as we don't duplicate the publication row itself, the fact that we have two different WOS Source Records is not a big deal (one is just orphaned). So perhaps just a guard clause around the linking logic in the "link_publication" method of WebOfScienceSourceRecord?