inspirehep / hepcrawl

Scrapy project for feeds into INSPIRE-HEP
http://inspirehep.net
Other
17 stars 30 forks source link

parsers: add crossref api parser #237

Closed vbalbp closed 6 years ago

vbalbp commented 6 years ago

Signed-off-by: Victor Balbuena vbalbp@gmail.com

Description

Related Issue

Motivation and Context

Checklist:

vbalbp commented 6 years ago

The xml has the following additional information compared to json:

        <crm-item name="publisher-name" type="string">Institute of Electrical and Electronics Engineers (IEEE)</crm-item>
        <crm-item name="prefix-name" type="string">Institute of Electrical and Electronics Engineers</crm-item>
        <crm-item name="member-id" type="number">263</crm-item>
        <crm-item name="citation-id" type="number">94433734</crm-item>
        <crm-item name="journal-id" type="number">5435</crm-item>
        <crm-item name="deposit-timestamp" type="number">20180212160014172</crm-item>
        <crm-item name="owner-prefix" type="string">10.1109</crm-item>
        <crm-item name="last-update" type="date">2018-02-12T22:42:58Z</crm-item>
        <crm-item name="created" type="date">2017-11-23T19:07:49Z</crm-item>
        <crm-item name="citedby-count" type="number">1</crm-item>

However, The 'prefix-name', 'member-id' and 'citedby-count' fields are also in the json result.

On the other hand, json has the field "references-count":13, which gives you the number of references in the record, and which is not found in the xml result (You can always count the references, though).

Apart from that, the rest of the fields are just a mapping from one format to the other, and there is no other extra information in any of them. For that reason, I chose to parse elements as a json result instead of xml since it's easier to parse and more similar to our current json format in labs.