inspirehep / inspire-next

The INSPIRE repo.
https://inspirehep.net
GNU General Public License v3.0
59 stars 69 forks source link

hal: extract from the response the HAL identifier of the duplicate #2731

Open jacquerie opened 7 years ago

jacquerie commented 7 years ago

In order to avoid crawling HAL to know which records on HAL correspond to which records on INSPIRE, we can parse their response when trying to create a new, duplicated, record and instead update that record on their system.

For example, let's consider https://hal.archives-ouvertes.fr/hal-01584710, which is the HAL push of https://inspirehep.net/record/1519372. When we run

>>> from inspirehep.utils.record_getter import get_db_record
>>> from inspirehep.modules.hal.core.tei import convert_to_tei
>>> from inspirehep.modules.hal.core.sword import create
>>> record = get_db_record('lit', 1519372)
>>> tei = convert_to_tei(record)
>>> create(tei.encode('utf8'))

we get a sword2.exceptions.HTTPResponseError back which contains a content attribute with

<?xml version="1.0" encoding="utf-8"?>
<sword:error xmlns:sword="http://purl.org/net/sword/error/" xmlns="http://www.w3.org/2005/Atom" href="http://purl.org/net/sword/error/ErrorBadRequest">
  <title>ERROR</title>
  <updated>2017-09-10T19:16:57+02:00</updated>
  <author>
    <name>HAL SWORD API Server</name>
  </author>
  <source>
    <generator uri="https://api.archives-ouvertes.fr/sword" version="1.0">hal@ccsd.cnrs.fr</generator>
  </source>
  <summary>Some parameters sent with the request were not understood</summary>
  <sword:treatment>processing failed</sword:treatment>
  <sword:verboseDescription>{"duplicate-entry":{"hal-01584710":{"arxiv":"1.0","doi":"1.0","inspire":"1.0"}}}</sword:verboseDescription>
  <link rel="alternate" href="https://api.archives-ouvertes.fr" type="text/html"/>
</sword:error>

which should be easy enough to parse (although wrapping JSON errors in XML is... perplexing).

jacquerie commented 7 years ago

CC: @kaplun @mathieugrives @StellaCh

The moral of the story is that we don't need any synchronization mechanism implemented or updated on legacy, as we can recover this information at push time (and write it in the record when labs is master).