relaton / relaton-bipm

MIT License
2 stars 0 forks source link

Parse BIPM Metrologia data from rawdata-bipm-metrologia #28

Open ronaldtse opened 2 years ago

ronaldtse commented 2 years ago

As described in https://github.com/relaton/relaton-data-bipm/issues/17 .

This task supersedes #2 which implemented support to retrieve Metrologia bibliographic data from IOP but was unsatisfactory due to remote performance issues.

BIPM has now provided the full bibliographic data set of Metrologia, and we have an agreement in place with IOP Publishing, the publisher. The dataset is now at https://github.com/relaton/rawdata-bipm-metrologia (private access).

The work here is to parse that dataset into the relaton-data-bipm Relaton repository.

(The following information is also provided in README.adoc of the repository but included here for clarity)

The full set of bibliographic data comes in a zipped format in the following structure:

2022-04-05T10_55_52_content/         - data archive level
  0026-1394/                         - ISSN of Metrologia (physical version)
    0026-1394_1/                     - Volume 1 of Metrologia
      0026-1394_1_1/                 - Issue 1 of Volume 1
        0026-1394_1_1_1/             - Article 1 of Issue 1 of Volume 1
          metv1i1p1.xml              - Bibliographic data of Article 1

    0026-1394_2/                     - Volume 2 of Metrologia
      0026-1394_2_1/                 - Issue 1 of Volume 2
        0026-1394_2_1_1/             - Article 1 of Issue 1 of Volume 2 (1, 6, 11 are page numbers)
        0026-1394_2_1_6/             - Article 6 of Issue 1 of Volume 2 (1, 6, 11 are page numbers)
        0026-1394_2_1_11/            - Article 11 of Issue 1 of Volume 2 (1, 6, 11 are page numbers)

    0026-1394_59/                    - Volume 59 of Metrologia
      0026-1394_59_1A/               - Issue 1A of Volume 59
        0026-1394_59_1A_01001/       - Article 01001 of Issue 1A of Volume 59
        ...                          
        0026-1394_59_1A_08005/       - Article 08005 of Issue 1A of Volume 59
          0026-1394_59_1A_08005.xml  - Bibliographic data of Article 08005
      0026-1394_59_2/                - Issue 2 of Volume 59
        0026-1394_59_2_022001/       - Article 022001 of Issue 2 of Volume 59
          met_59_2_022001.xml        - Bibliographic data of Article 022001

Subsequent updates will be provided also in the archived format.

The update archives have the same structure:

2022-06-02T03_01_55_content/         - data archive level
  0026-1394/                         - ISSN of Metrologia (physical version)
    0026-1394_59/                    - Volume 59 of Metrologia
      0026-1394_59_3/                - Issue 3 of Volume 59
        0026-1394_59_3_034002/       - Article 034002 of Issue 3 of Volume 59
          met_59_3_034002.xml        - Bibliographic data of Article 034002

We need to parse this archive into a Relaton dataset.

Notice in the folder/file structure:

Contents of metv1i1p1.xml:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.1 20151215//EN" "JATS-journalpublishing1.dtd">
<article 
  xmlns:mml="http://www.w3.org/1998/Math/MathML" 
  xmlns:xlink="http://www.w3.org/1999/xlink" 
  article-type="editorial" 
  dtd-version="1.1" 
  xml:lang="en">
  <front>
    <journal-meta>
      <journal-id journal-id-type="publisher-id">met</journal-id>
      <journal-title-group>
        <journal-title>Metrologia</journal-title>
        <abbrev-journal-title abbrev-type="IOP">met</abbrev-journal-title>
        <abbrev-journal-title abbrev-type="publisher">Metrologia</abbrev-journal-title>
      </journal-title-group>
      <issn pub-type="ppub">0026-1394</issn>
      <issn pub-type="epub">1681-7575</issn>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="publisher-id">0026-1394__</article-id>
      <article-id pub-id-type="doi">10.1088/0026-1394/1/1/001</article-id>
      <article-id pub-id-type="manuscript">001</article-id>
      <article-categories>
        <subj-group subj-group-type="display-article-type">
          <subject/>
        </subj-group>
      </article-categories>
      <title-group>
        <article-title xml:lang="en">The Role and Policy of  <italic>Metrologia</italic></article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author" xlink:type="simple">
          <name>
            <surname>L E Howlett</surname>
          </name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff1"><label>1</label>Editor, Ottawa, Canada</aff>
      </contrib-group>
      <pub-date pub-type="ppub">
        <day>01</day>
        <month>01</month>
        <year>1965</year>
      </pub-date>
      <volume>1</volume>
      <issue>1</issue>
      <fpage>1</fpage>
      <lpage>1</lpage>
      <permissions>
        <copyright-statement>Published under licence by IOP Publishing Ltd</copyright-statement>
        <copyright-year>1965</copyright-year>
      </permissions>
      <self-uri content-type="pdf" xlink:href="metv1i1p1.pdf" xlink:type="simple"/>
      <abstract xml:lang="en">
        <p>Today it is often said ... After much study ...<italic>Metrologia</italic> ... <italic>Metrologia</italic> ...</p>
        <p>This journal will ...</p>
        <p>Preference will ...</p>
        <p>Review articles will ...</p>
        <p>The journal will ...</p>
        <p>There will be a...</p>
        <p>Ability to measure ...</p>
      </abstract>
    </article-meta>
  </front>
</article>

Contents of 0026-1394_59_1A_08005.xml:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.1 20151215//EN" "JATS-journalpublishing1.dtd">
<article 
  xmlns:mml="http://www.w3.org/1998/Math/MathML" 
  xmlns:xlink="http://www.w3.org/1999/xlink" 
  article-type="note" 
  dtd-version="1.1" 
  xml:lang="en">
  <front>
    <journal-meta>
      <journal-id journal-id-type="publisher-id">met</journal-id>
      <journal-id journal-id-type="coden">MTRGAU</journal-id>
      <journal-title-group>
        <journal-title xml:lang="en">Metrologia</journal-title>
        <abbrev-journal-title abbrev-type="IOP" xml:lang="en">MET</abbrev-journal-title>
        <abbrev-journal-title abbrev-type="publisher" xml:lang="en">Metrologia</abbrev-journal-title>
      </journal-title-group>
      <issn pub-type="ppub">0026-1394</issn>
      <issn pub-type="epub">1681-7575</issn>
      <publisher>
        <publisher-name>IOP Publishing</publisher-name>
      </publisher>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="publisher-id">met_59_1A_08005</article-id>
      <article-id pub-id-type="doi">10.1088/0026-1394/59/1A/08005</article-id>
      <article-id pub-id-type="manuscript">met_59_1A_08005</article-id>
      <article-categories>
        <subj-group subj-group-type="display-article-type">
          <subject>PILOT STUDY</subject>
        </subj-group>
      </article-categories>
      <title-group>
        <article-title>Final report on pilot study CCQM-P211: carbon isotope delta measurements of vanillin</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author" xlink:type="simple">
          <contrib-id authenticated="false" contrib-id-type="orcid">0000-0003-3398-7246</contrib-id>
          <name name-style="western">
            <surname>Chartrand</surname>
            <given-names>Michelle M G</given-names>
          </name>
          <xref ref-type="aff" rid="affiliation01">1</xref>
        </contrib>
        <contrib contrib-type="author" xlink:type="simple">
          <contrib-id authenticated="false" contrib-id-type="orcid">0000-0002-4126-3515</contrib-id>
          <name name-style="western">
            <surname>Kai</surname>
            <given-names>Fuu Ming</given-names>
          </name>
          <xref ref-type="aff" rid="affiliation02">2</xref>
        </contrib>
        <contrib contrib-type="author" xlink:type="simple">
          <contrib-id authenticated="false" contrib-id-type="orcid">0000-0001-8744-5632</contrib-id>
          <name name-style="western">
            <surname>Meijer</surname>
            <given-names>Harro A J</given-names>
          </name>
          <xref ref-type="aff" rid="affiliation03">3</xref>
        </contrib>
        <contrib contrib-type="author" xlink:type="simple">
          <contrib-id authenticated="false" contrib-id-type="orcid">0000-0003-4768-2603</contrib-id>
          <name name-style="western">
            <surname>Moossen</surname>
            <given-names>Heiko</given-names>
          </name>
          <xref ref-type="aff" rid="affiliation04">4</xref>
        </contrib>
        <contrib contrib-type="author" xlink:type="simple">
          <contrib-id authenticated="false" contrib-id-type="orcid">0000-0002-8339-744X</contrib-id>
          <name name-style="western">
            <surname>Qi</surname>
            <given-names>Haiping</given-names>
          </name>
          <xref ref-type="aff" rid="affiliation05">5</xref>
        </contrib>
        <contrib contrib-type="author" xlink:type="simple">
          <contrib-id authenticated="false" contrib-id-type="orcid">0000-0002-7227-791X</contrib-id>
          <name name-style="western">
            <surname>Aerts-Bijma</surname>
            <given-names>Anita T</given-names>
          </name>
          <xref ref-type="aff" rid="affiliation03">3</xref>
        </contrib>
        <contrib contrib-type="author" xlink:type="simple">
          <name name-style="western">
            <surname>Cui</surname>
            <given-names>Yuxi</given-names>
          </name>
          <xref ref-type="aff" rid="affiliation02">2</xref>
        </contrib>
        <contrib contrib-type="author" xlink:type="simple">
          <name name-style="western">
            <surname>Geilmann</surname>
            <given-names>Heike</given-names>
          </name>
          <xref ref-type="aff" rid="affiliation04">4</xref>
        </contrib>
        <contrib contrib-type="author" xlink:type="simple">
          <contrib-id authenticated="false" contrib-id-type="orcid">0000-0002-2377-2615</contrib-id>
          <name name-style="western">
            <surname>Mester</surname>
            <given-names>Zoltan</given-names>
          </name>
          <xref ref-type="aff" rid="affiliation01">1</xref>
        </contrib>
        <contrib contrib-type="author" xlink:type="simple">
          <contrib-id authenticated="false" contrib-id-type="orcid">0000-0002-3349-5535</contrib-id>
          <name name-style="western">
            <surname>Meija</surname>
            <given-names>Juris</given-names>
          </name>
          <xref ref-type="aff" rid="affiliation01">1</xref>
        </contrib>
        <aff id="affiliation01"><label>1</label>
National Research Council Canada, Metrology, 1200 Montreal Rd., Ottawa, K1A 0R6, Canada</aff>
        <aff id="affiliation02"><label>2</label>
National Metrology Centre, Agency for Science, Technology and Research, 8 Cleantech Loop, 637145, Singapore</aff>
        <aff id="affiliation03"><label>3</label>
Centre for Isotope Research, University of Groningen, Nijenborgh 6, 9747 AG Groningen, The Netherlands</aff>
        <aff id="affiliation04"><label>4</label>
Stable Isotope Laboratory, Max Planck Institute for Biogeochemistry, Hans-Knoell-St. 10, 07745, Jena, Germany</aff>
        <aff id="affiliation05"><label>5</label>
US Geological Survey, Reston, VA 20192, USA</aff>
      </contrib-group>
      <pub-date pub-type="ppub">
        <day>01</day>
        <month>1</month>
        <year>2022</year>
      </pub-date>
      <pub-date pub-type="epub">
        <day>18</day>
        <month>2</month>
        <year>2022</year>
      </pub-date>
      <volume>59</volume>
      <issue>1A</issue>
      <elocation-id content-type="artnum">08005</elocation-id>
      <permissions>
        <copyright-statement>© 2022 BIPM &amp; IOP Publishing Ltd</copyright-statement>
        <copyright-year>2022</copyright-year>
        <license license-type="iop-standard" xlink:href="https://publishingsupport.iopscience.iop.org/iop-standard/v1">
          <license-p>This article is available under the terms of the <ext-link ext-link-type="uri">IOP-Standard License</ext-link>.</license-p>
        </license>
      </permissions>
      <abstract>
        <title>Main text</title>
        <p>This pilot study was ...</p>
        <p>To reach the main text of this paper, click on <ext-link xlink:href="https://www.bipm.org/documents/20126/67196226/CCQM-P211.pdf/03820c42-15b0-6849-3cde-aa6a1a105b42" xlink:type="simple">Final Report</ext-link>.</p>
        <p>The final report has been peer-reviewed and approved for publication by the CCQM.</p>
      </abstract>
    </article-meta>
  </front>
</article>
ronaldtse commented 1 year ago

We need to action this issue ASAP due to BIPM request.

The corresponding data sync work has been done by @CAMOBAP at:

andrew2net commented 1 year ago

@ronaldtse there are two date types in the source:

      <pub-date pub-type="ppub">
        <day>01</day>
        <month>1</month>
        <year>2022</year>
      </pub-date>
      <pub-date pub-type="epub">
        <day>18</day>
        <month>2</month>
        <year>2022</year>
      </pub-date>

Nick's suggestion is treating "epub" type as relation:

<relation type="hasManifestation">
  <bibitem>
    <title>(same)</title>
    <date>2022-02-18</date>
    <medium><carrier>online resource</carrier></medium>
  </bibitem>
</relation>
ronaldtse commented 1 year ago

This is a good point.

We should take the earliest date of ppub and epub date as the date of publication.

I think that even the original "ppub" (which stands for "print publication", according to JATS) should also be encoded as a new manifestation:

<relation type="hasManifestation">
  <bibitem>
    <title>(same)</title>
    <date>2022-01-01</date>
    <medium><carrier>print</carrier></medium>
  </bibitem>
</relation>
<relation type="hasManifestation">
  <bibitem>
    <title>(same)</title>
    <date>2022-02-18</date>
    <medium><carrier>traditional</carrier></medium>
  </bibitem>
</relation>
andrew2net commented 1 year ago

@ronaldtse it seems the data source doesn't provide URL's.

ronaldtse commented 1 year ago

Then we don't need to provide a URL. We do have DOIs, so that is sufficient.

andrew2net commented 1 year ago

@ronaldtse yes, we do have DOIs for articles. But we also need to create issue documents with article relations, volume documents with issue relations, and root "Metrologia" documents with volume relations. Can we have these documents without URLs?

ronaldtse commented 1 year ago

I think so for the moment. Let me ask BIPM/IOPP to provide URLs for these entries.

ronaldtse commented 1 year ago

I have asked BIPM for URLs. For the moment, let's continue with URLs and file a ticket to keep track.

ronaldtse commented 1 year ago

BIPM's Janet Miles says we should use the DOI for URL for articles. For volume and issues, there are no DOIs.

Let's use these URLs instead:

andrew2net commented 1 year ago

@ronaldtse the source file rawdata-bipm-metrologia/2022-04-05T10_55_52_content/0026-1394/0026-1394_37/0026-1394_37_5/0026-1394_37_5_68/me0568.xml misses page (article) number. It has the title "Index of Contributors" so it should have page 68 https://iopscience.iop.org/article/10.1088/0026-1394/37/5/68. Is it BIPM's mistake?

UPD same for: rawdata-bipm-metrologia/2022-04-05T10_55_52_content/0026-1394/0026-1394_40/0026-1394_40_1/0026-1394_40_1_001/0026-1394_40_1_001.xml https://iopscience.iop.org/article/10.1088/0026-1394/40/1/001

ronaldtse commented 1 year ago

@andrew2net have you re-pulled from this repo? The data path is different now.

I can see in the first file: data/2022-04-05T10_55_52_content/0026-1394/0026-1394_37/0026-1394_37_5/0026-1394_37_5_68/me0568.xml

        <article-id pub-id-type="manuscript">
          68
        </article-id>
        <title-group>
          <article-title xml:lang="en">
            Index of Contributors
          </article-title>
        </title-group>

The number 68 is present.

In the second file: data/2022-04-05T10_55_52_content/0026-1394/0026-1394_40/0026-1394_40_1/0026-1394_40_1_001/0026-1394_40_1_001.xml

        <article-id pub-id-type="manuscript">
          001
        </article-id>
        <title-group>
          <article-title xml:lang="en">
            Editorial
          </article-title>
        </title-group>

The 001 is also present.

andrew2net commented 1 year ago

@ronaldtse indeed. You are right about these documents, but most documents have an fpage element. For example the

rawdata-bipm-metrologia/2022-04-05T10_55_52_content/0026-1394/0026-1394_29/0026-1394_29_6/0026-1394_29_6_373/metv29i6p373.xml

has <fpage>373</fpage>, and <article-id pub-id-type="manuscript">001</article-id>. So it seems if there is fpage we should use it as an article, otherwise use article-id pub-id-type[@type="manuscript"]. Am I right?

ronaldtse commented 1 year ago

It seems so. What a strange encoding.

ronaldtse commented 1 year ago

Can you document this strange behavior in the README? Thanks.

andrew2net commented 1 year ago

@ronaldtse if we use fpage as an article number then we have document ID duplication. So I use article-id [@pub-id-type="manuscript"] type currently, but we have different articles number now.

andrew2net commented 1 year ago

@ronaldtse here are duplicates in the source dataset:

"Metrologia 59 1A 06011"
rawdata-bipm-metrologia/data/2022-05-28T03_01_55_content/0026-1394/0026-1394_59/0026-1394_59_1A/0026-1394_59_1A_06011/0026-1394_59_1A_06011.xml
rawdata-bipm-metrologia/data/2022-06-29T03_01_46_content/0026-1394/0026-1394_59/0026-1394_59_1A/0026-1394_59_1A_06011/0026-1394_59_1A_06011.xml
"Metrologia 59 4 ac7687"
rawdata-bipm-metrologia/data/2022-07-07T03_01_47_content/0026-1394/0026-1394_59/0026-1394_59_4/0026-1394_59_4_045007/met_59_4_045007.xml
rawdata-bipm-metrologia/data/2022-10-15T03_01_48_content/0026-1394/0026-1394_59/0026-1394_59_4/0026-1394_59_4_045007/met_59_4_045007.xml
"Metrologia 59 1A 08013"
rawdata-bipm-metrologia/data/2022-09-03T03_01_53_content/0026-1394/0026-1394_59/0026-1394_59_1A/0026-1394_59_1A_08013/0026-1394_59_1A_08013.xml
rawdata-bipm-metrologia/data/2022-09-14T03_01_45_content/0026-1394/0026-1394_59/0026-1394_59_1A/0026-1394_59_1A_08013/0026-1394_59_1A_08013.xml
"Metrologia 59 6 ac98cb"
rawdata-bipm-metrologia/data/2022-10-29T03_01_45_content/0026-1394/0026-1394_59/0026-1394_59_6/0026-1394_59_6_064001/met_59_6_064001.xml
rawdata-bipm-metrologia/data/2022-11-17T03_01_46_content/0026-1394/0026-1394_59/0026-1394_59_6/0026-1394_59_6_064001/met_59_6_064001.xml
rawdata-bipm-metrologia/data/2022-11-24T03_01_45_content/0026-1394/0026-1394_59/0026-1394_59_6/0026-1394_59_6_064001/met_59_6_064001.xml
"Metrologia 59 1A 07020"
rawdata-bipm-metrologia/data/2022-11-18T03_01_53_content/0026-1394/0026-1394_59/0026-1394_59_1A/0026-1394_59_1A_07020/0026-1394_59_1A_07020.xml
rawdata-bipm-metrologia/data/2022-11-26T03_01_46_content/0026-1394/0026-1394_59/0026-1394_59_1A/0026-1394_59_1A_07020/0026-1394_59_1A_07020.xml
"Metrologia 60 1A 01001"
rawdata-bipm-metrologia/data/2023-01-05T03_01_46_content/0026-1394/0026-1394_60/0026-1394_60_1A/0026-1394_60_1A_01001/0026-1394_60_1A_01001.xml
rawdata-bipm-metrologia/data/2023-01-06T03_01_49_content/0026-1394/0026-1394_60/0026-1394_60_1A/0026-1394_60_1A_01001/0026-1394_60_1A_01001.xml
ronaldtse commented 1 year ago

@andrew2net sorry to get back late here. For these source duplications:

  1. Are the contents identical?
  2. If not, do we need to merge them?
  3. Can we just take the newest copy? (if the newer ones are corrections)

Thanks!

andrew2net commented 9 months ago

@ronaldtse

@andrew2net sorry to get back late here. For these source duplications:

  1. Are the contents identical?

In the 1, 5, and 6 cases the docs have difference in contributors. One doc has extra contributors.

In the 2 case docs look identical, but one of them has back element after front

  ...
  </front>
  <back>
    <ref-list content-type="numerical">
      <title>References</title>
      <ref id="metac7687bib1">
        <label>1</label>
        <element-citation publication-type="journal" xlink:type="simple">
          <person-group person-group-type="author">
            <name name-style="western">
              <surname>Petit</surname>
              <given-names>G</given-names>
            </name>
            <name name-style="western">
              <surname>Jiang</surname>
              <given-names>Z</given-names>
            </name>
          </person-group>
          <year>2008</year>
          <source>Int. J. Navig. Obs.</source>
          <volume>2008</volume>
          <fpage>1</fpage>
          <lpage>8</lpage>
          <page-range>1–8</page-range>
          <pub-id pub-id-type="doi">10.1155/2008/562878</pub-id>
        </element-citation>
      </ref>
      <ref id="metac7687bib2">
        <label>2</label>
       ...

I looks like relations. Shouldn't we parse the relations?

In the 3 and 4 cases the docs look identical.

  1. If not, do we need to merge them?

I think we should merge them

  1. Can we just take the newest copy? (if the newer ones are corrections)

In these cases dates are identical.

ronaldtse commented 4 weeks ago

@andrew2net sorry for the late reply. ref-list is a list of bibliographic references.

In Relaton, our data model should also support external bibliographic references. So yes these are "relations" (since a reference is a kind of relation).

For de-duplication purposes, we should use the filenames of their encoding which indicate the date this record was created, i.e.

"Metrologia 60 1A 01001"
rawdata-bipm-metrologia/data/2023-01-05T03_01_46_content/0026-1394/0026-1394_60/0026-1394_60_1A/0026-1394_60_1A_01001/0026-1394_60_1A_01001.xml
rawdata-bipm-metrologia/data/2023-01-06T03_01_49_content/0026-1394/0026-1394_60/0026-1394_60_1A/0026-1394_60_1A_01001/0026-1394_60_1A_01001.xml

The first item has date 2023-01-05T03_01_46 and the second has 2023-01-06T03_01_49. So we consider the "record entry date" of the item as 2023-01-05T03:01:46 and 2023-01-06T03:01:49. We take the newer item so the second one.