Open ronaldtse opened 2 years ago
We need to action this issue ASAP due to BIPM request.
The corresponding data sync work has been done by @CAMOBAP at:
@ronaldtse there are two date types in the source:
<pub-date pub-type="ppub">
<day>01</day>
<month>1</month>
<year>2022</year>
</pub-date>
<pub-date pub-type="epub">
<day>18</day>
<month>2</month>
<year>2022</year>
</pub-date>
Nick's suggestion is treating "epub" type as relation:
<relation type="hasManifestation">
<bibitem>
<title>(same)</title>
<date>2022-02-18</date>
<medium><carrier>online resource</carrier></medium>
</bibitem>
</relation>
This is a good point.
We should take the earliest date of ppub
and epub
date as the date of publication.
I think that even the original "ppub" (which stands for "print publication", according to JATS) should also be encoded as a new manifestation:
<relation type="hasManifestation">
<bibitem>
<title>(same)</title>
<date>2022-01-01</date>
<medium><carrier>print</carrier></medium>
</bibitem>
</relation>
<relation type="hasManifestation">
<bibitem>
<title>(same)</title>
<date>2022-02-18</date>
<medium><carrier>traditional</carrier></medium>
</bibitem>
</relation>
@ronaldtse it seems the data source doesn't provide URL's.
Then we don't need to provide a URL. We do have DOIs, so that is sufficient.
@ronaldtse yes, we do have DOIs for articles. But we also need to create issue documents with article relations, volume documents with issue relations, and root "Metrologia" documents with volume relations. Can we have these documents without URLs?
I think so for the moment. Let me ask BIPM/IOPP to provide URLs for these entries.
I have asked BIPM for URLs. For the moment, let's continue with URLs and file a ticket to keep track.
BIPM's Janet Miles says we should use the DOI for URL for articles. For volume and issues, there are no DOIs.
Let's use these URLs instead:
@ronaldtse the source file rawdata-bipm-metrologia/2022-04-05T10_55_52_content/0026-1394/0026-1394_37/0026-1394_37_5/0026-1394_37_5_68/me0568.xml misses page (article) number. It has the title "Index of Contributors" so it should have page 68 https://iopscience.iop.org/article/10.1088/0026-1394/37/5/68. Is it BIPM's mistake?
UPD same for: rawdata-bipm-metrologia/2022-04-05T10_55_52_content/0026-1394/0026-1394_40/0026-1394_40_1/0026-1394_40_1_001/0026-1394_40_1_001.xml https://iopscience.iop.org/article/10.1088/0026-1394/40/1/001
@andrew2net have you re-pulled from this repo? The data path is different now.
I can see in the first file:
data/2022-04-05T10_55_52_content/0026-1394/0026-1394_37/0026-1394_37_5/0026-1394_37_5_68/me0568.xml
<article-id pub-id-type="manuscript">
68
</article-id>
<title-group>
<article-title xml:lang="en">
Index of Contributors
</article-title>
</title-group>
The number 68
is present.
In the second file:
data/2022-04-05T10_55_52_content/0026-1394/0026-1394_40/0026-1394_40_1/0026-1394_40_1_001/0026-1394_40_1_001.xml
<article-id pub-id-type="manuscript">
001
</article-id>
<title-group>
<article-title xml:lang="en">
Editorial
</article-title>
</title-group>
The 001
is also present.
@ronaldtse indeed. You are right about these documents, but most documents have an fpage
element. For example the
rawdata-bipm-metrologia/2022-04-05T10_55_52_content/0026-1394/0026-1394_29/0026-1394_29_6/0026-1394_29_6_373/metv29i6p373.xml
has <fpage>373</fpage>
, and <article-id pub-id-type="manuscript">001</article-id>
. So it seems if there is fpage
we should use it as an article, otherwise use article-id pub-id-type[@type="manuscript"]
. Am I right?
It seems so. What a strange encoding.
Can you document this strange behavior in the README? Thanks.
@ronaldtse if we use fpage
as an article number then we have document ID duplication. So I use article-id [@pub-id-type="manuscript"]
type currently, but we have different articles number now.
@ronaldtse here are duplicates in the source dataset:
"Metrologia 59 1A 06011"
rawdata-bipm-metrologia/data/2022-05-28T03_01_55_content/0026-1394/0026-1394_59/0026-1394_59_1A/0026-1394_59_1A_06011/0026-1394_59_1A_06011.xml
rawdata-bipm-metrologia/data/2022-06-29T03_01_46_content/0026-1394/0026-1394_59/0026-1394_59_1A/0026-1394_59_1A_06011/0026-1394_59_1A_06011.xml
"Metrologia 59 4 ac7687"
rawdata-bipm-metrologia/data/2022-07-07T03_01_47_content/0026-1394/0026-1394_59/0026-1394_59_4/0026-1394_59_4_045007/met_59_4_045007.xml
rawdata-bipm-metrologia/data/2022-10-15T03_01_48_content/0026-1394/0026-1394_59/0026-1394_59_4/0026-1394_59_4_045007/met_59_4_045007.xml
"Metrologia 59 1A 08013"
rawdata-bipm-metrologia/data/2022-09-03T03_01_53_content/0026-1394/0026-1394_59/0026-1394_59_1A/0026-1394_59_1A_08013/0026-1394_59_1A_08013.xml
rawdata-bipm-metrologia/data/2022-09-14T03_01_45_content/0026-1394/0026-1394_59/0026-1394_59_1A/0026-1394_59_1A_08013/0026-1394_59_1A_08013.xml
"Metrologia 59 6 ac98cb"
rawdata-bipm-metrologia/data/2022-10-29T03_01_45_content/0026-1394/0026-1394_59/0026-1394_59_6/0026-1394_59_6_064001/met_59_6_064001.xml
rawdata-bipm-metrologia/data/2022-11-17T03_01_46_content/0026-1394/0026-1394_59/0026-1394_59_6/0026-1394_59_6_064001/met_59_6_064001.xml
rawdata-bipm-metrologia/data/2022-11-24T03_01_45_content/0026-1394/0026-1394_59/0026-1394_59_6/0026-1394_59_6_064001/met_59_6_064001.xml
"Metrologia 59 1A 07020"
rawdata-bipm-metrologia/data/2022-11-18T03_01_53_content/0026-1394/0026-1394_59/0026-1394_59_1A/0026-1394_59_1A_07020/0026-1394_59_1A_07020.xml
rawdata-bipm-metrologia/data/2022-11-26T03_01_46_content/0026-1394/0026-1394_59/0026-1394_59_1A/0026-1394_59_1A_07020/0026-1394_59_1A_07020.xml
"Metrologia 60 1A 01001"
rawdata-bipm-metrologia/data/2023-01-05T03_01_46_content/0026-1394/0026-1394_60/0026-1394_60_1A/0026-1394_60_1A_01001/0026-1394_60_1A_01001.xml
rawdata-bipm-metrologia/data/2023-01-06T03_01_49_content/0026-1394/0026-1394_60/0026-1394_60_1A/0026-1394_60_1A_01001/0026-1394_60_1A_01001.xml
@andrew2net sorry to get back late here. For these source duplications:
Thanks!
@ronaldtse
@andrew2net sorry to get back late here. For these source duplications:
- Are the contents identical?
In the 1, 5, and 6 cases the docs have difference in contributors. One doc has extra contributors.
In the 2 case docs look identical, but one of them has back
element after front
...
</front>
<back>
<ref-list content-type="numerical">
<title>References</title>
<ref id="metac7687bib1">
<label>1</label>
<element-citation publication-type="journal" xlink:type="simple">
<person-group person-group-type="author">
<name name-style="western">
<surname>Petit</surname>
<given-names>G</given-names>
</name>
<name name-style="western">
<surname>Jiang</surname>
<given-names>Z</given-names>
</name>
</person-group>
<year>2008</year>
<source>Int. J. Navig. Obs.</source>
<volume>2008</volume>
<fpage>1</fpage>
<lpage>8</lpage>
<page-range>1–8</page-range>
<pub-id pub-id-type="doi">10.1155/2008/562878</pub-id>
</element-citation>
</ref>
<ref id="metac7687bib2">
<label>2</label>
...
I looks like relations. Shouldn't we parse the relations?
In the 3 and 4 cases the docs look identical.
- If not, do we need to merge them?
I think we should merge them
- Can we just take the newest copy? (if the newer ones are corrections)
In these cases dates are identical.
@andrew2net sorry for the late reply. ref-list
is a list of bibliographic references.
In Relaton, our data model should also support external bibliographic references. So yes these are "relations" (since a reference is a kind of relation).
For de-duplication purposes, we should use the filenames of their encoding which indicate the date this record was created, i.e.
"Metrologia 60 1A 01001"
rawdata-bipm-metrologia/data/2023-01-05T03_01_46_content/0026-1394/0026-1394_60/0026-1394_60_1A/0026-1394_60_1A_01001/0026-1394_60_1A_01001.xml
rawdata-bipm-metrologia/data/2023-01-06T03_01_49_content/0026-1394/0026-1394_60/0026-1394_60_1A/0026-1394_60_1A_01001/0026-1394_60_1A_01001.xml
The first item has date 2023-01-05T03_01_46
and the second has 2023-01-06T03_01_49
. So we consider the "record entry date" of the item as 2023-01-05T03:01:46
and 2023-01-06T03:01:49
. We take the newer item so the second one.
As described in https://github.com/relaton/relaton-data-bipm/issues/17 .
This task supersedes #2 which implemented support to retrieve Metrologia bibliographic data from IOP but was unsatisfactory due to remote performance issues.
BIPM has now provided the full bibliographic data set of Metrologia, and we have an agreement in place with IOP Publishing, the publisher. The dataset is now at https://github.com/relaton/rawdata-bipm-metrologia (private access).
The work here is to parse that dataset into the relaton-data-bipm Relaton repository.
(The following information is also provided in README.adoc of the repository but included here for clarity)
The full set of bibliographic data comes in a zipped format in the following structure:
Subsequent updates will be provided also in the archived format.
The update archives have the same structure:
We need to parse this archive into a Relaton dataset.
Notice in the folder/file structure:
Contents of
metv1i1p1.xml
:Contents of
0026-1394_59_1A_08005.xml
: