inspirehep / inspire-next

The INSPIRE repo.
https://inspirehep.net
GNU General Public License v3.0
59 stars 69 forks source link

arXiv dubs #3575

Open ksachs opened 6 years ago

ksachs commented 6 years ago

trying to trace why arXiv records are created twice.

instead of an update a new record is created

arXiv:1807.07025, 1683196, 1683259 arXiv:1807.06513, 1682949, 1682955

E.g.

001682955 541__ $$aarXiv$$chepcrawl$$d2018-07-18T03:36:57.313380$$e1132375  
Record added 2018-07-18, last modified 2018-07-18

001682949 541__ $$aarXiv$$chepcrawl$$d2018-07-19T03:36:40.280688$$e1134788    
Record added 2018-07-18, last modified 2018-07-19 
which is not true, the record was created on 2018-07-19

The second 1134788 is correctly identified as exact-match. But instead of a replace a new record is created. The creation date of this new record is inherited from the old record.

https://labs.inspirehep.net/api/holdingpen/1134788 contains:

"callback_result": {
  "marcxml": "<record>\n  <controlfield tag=\"001\">1682949</controlfield

For some reason a new recid is added to

/opt/cds-invenio/var/tmp-shared/batchupload_20180719050932_D9f4T8
<controlfield tag="001">1682949</controlfield>

Where is this new recid coming from?? Can it overwrite something?

update comes in while first record is halted

For these I'm not sure I understand the info in the api: arXiv:1807.10190, 1684265, 1684268 arXiv:1807.09872, 1684269, 1684274 arXiv:1807.10163, 1684266, 1684273

What I belive, e.g.

001684265 541__ $$aarXiv$$chepcrawl$$d2018-07-27T03:35:47.430949$$e1147082  
001684268 541__ $$aarXiv$$chepcrawl$$d2018-07-28T03:35:35.671772$$e1148488   

the first is halted for match-approval. While it is halted the second comes in. Now they also somehow match themselves. But both upload files contain no controlnumber, each creating a new record.

https://labs.inspirehep.net/api/holdingpen/1147082
"exact-matched": true, 
"fuzzy_match_approved_id": null, 
"holdingpen_matches": [
  1148488
], 
"is-update": true, 
"matches": {
  "approved": 1684265, 
  "exact": [
    1684265, 
    1684268
  ], 
  "fuzzy": [
      "control_number": 1665833, 
"marcxml": "<record>\n  <controlfield tag=\"001\">1684265</controlfield>

https://labs.inspirehep.net/api/holdingpen/1148488
"exact-matched": true, 
"fuzzy_match_approved_id": null, 
"holdingpen_matches": [
  1147082
], 
"is-update": true, 
"matches": {
  "approved": 1684265, 
  "exact": [
    1684265, 
    1684268
  ], 
  "fuzzy": [
      "control_number": 1665833, 
"marcxml": "<record>\n  <controlfield tag=\"001\">1684265</controlfield>\n
ksachs commented 6 years ago

correction: all upload files I checked have controlnumber, i.e. recid.

ksachs commented 6 years ago

each pair was created from the same workflow right after the other (successive recids) All workflows have an error message in extra_data. Maybe the double upload was triggered by a restart? ​​

arXiv:1808.01257, 1685054, 1685055 001685055 541 $$aarXiv$$chepcrawl$$d2018-08-06T03:35:25.423672$$e1160371 001685054 541 $$aarXiv$$chepcrawl$$d2018-08-06T03:35:25.423672$$e1160371

arXiv:1808.01365, 1685234, 1685235 001685235 541 $$aarXiv$$chepcrawl$$d2018-08-07T03:43:48.991131$$e1161286 001685234 541 $$aarXiv$$chepcrawl$$d2018-08-07T03:43:48.991131$$e1161286

arXiv:1808.01473, 1685232, 1685233 001685233 541 $$aarXiv$$chepcrawl$$d2018-08-07T03:43:50.714420$$e1161331 001685232 541 $$aarXiv$$chepcrawl$$d2018-08-07T03:43:50.714420$$e1161331

ksachs commented 6 years ago

Another update that came in while the first record was halted. Somehow the order of actions might not be right. The the second worflow (claims to) stop the first only after being halted for match approval. The first wf continues anyhow, is again stopped for matching and in the end send_to_legacy.

001688926 037__ $$9arXiv$$aarXiv:1808.05450$$chep-ph
001688926 541__ $$aarXiv$$chepcrawl$$d2018-08-17T03:35:14.401921$$e1177634

001688751 541__ $$aarXiv$$chepcrawl$$d2018-08-18T03:35:02.440396$$e1179088

WorkFlow:1177634
  {
    "nicename": "\"Halted for matching approval.\"", 
    "time": "2018-08-17 03:53:26.347351"
  }, 
  {
    "nicename": "Mark the workflow object with stopped-by-wf:1179088.", 
    "time": "2018-08-20 15:00:45.534012"
      }, 
  ....
  {
    "nicename": "\"Halted for matching approval.\"", 
    "time": "2018-08-20 15:01:08.732753"
  }, 
  {
    "doc": "IF_ELSE: args(<function is_fuzzy_match_approved at 0x7f22106d0ed8>, ....
    "time": "2018-08-21 07:32:05.250600"
  }, 
  ....
  {
    "nicename": "send_to_legacy", 
    "time": "2018-08-21 07:32:56.497283"
  }, 
"holdingpen_matches": [
  1179088
], 

WorkFlow:1179088
  {
    "nicename": "Mark the workflow object with already-in-holding-pen:True.", 
    "time": "2018-08-18 03:42:35.836757"
  }, 
  ....
  {
    "nicename": "\"Halted for matching approval.\"", 
    "time": "2018-08-18 03:42:36.372337"
  }, 
  {
    "nicename": "Stop the matched workflow objects in the holdingpen.", 
    "time": "2018-08-20 15:00:45.712279"
  }, 
  ....
  {
    "nicename": "send_to_legacy", 
    "time": "2018-08-20 15:03:52.980712"
  }, 
  ....
  {
    "nicename": "Mark the workflow object with stopped-by-wf:1177634.", 
    "time": "2018-08-21 07:32:05.442135"
  }, 
"holdingpen_matches": [
  1177634
],