Closed ronaldtse closed 1 month ago
The fetched value is as follows (BIPM Metrologia 56 2 022001):
<name>
<forename language="en" script="Latn" initial="J">Martin</forename>
<forename initial="T"/>
<surname language="en" script="Latn">Milton</surname>
</name>
Ping @andrew2net
forename/@initial
is the canonical initial of that forename, not another one. I use it if present, but J does not stand for Martin. So if the name is Martin J Milton, I would expect to see
<forename>Martin</forename>
<forename initial="J"/>
<surname>Milton</surname>
or at most
<forename initial="M">Martin</forename> # but by default I use the first letter anyway
<forename initial="J"/>
<formatted-initials>M</formatted-initials> # do not use M J
<surname>Milton</surname>
@ronaldtse I've checked the source dataset. It has inconsistency in person names:
I don't see staring way to correct parse all the cases. Isn't it a good idea just concatenate given name and surname save them as a full name string?
@andrew2net for "source dataset" did you mean the IOP Metrologia XML, which has that article like this?
The source XML is:
<contrib contrib-type="author" xlink:type="simple">
<contrib-id authenticated="false" contrib-id-type="orcid">0000-0002-8174-2211</contrib-id>
<name name-style="western">
<surname>Milton</surname>
<given-names>Martin J T</given-names>
</name>
<xref ref-type="aff" rid="affiliation01">1</xref>
</contrib>
So this means that there is a problem with parsing source XML -- is it this issue?
@ronaldtse I've updated the parser to solve this issue. This case is not a big problem. To be sure that the update won't cause other problems I've checked the what names there are in the rawdata-bipm-metrologia dataset. It revealed that there are many others problem with names consistency in the dataset. So the question is: should I update the parser to solve only this issue or it's better to not parse names' parts and save them as fullnames?
The relaton/relaton-bipm#2 issue is about documents duplications. I'll answer to your comment in the issue.
So this means that there is a problem with parsing source XML -- is it this issue?
@andrew2net
I've checked the what names there are in the rawdata-bipm-metrologia dataset. It revealed that there are many others problem with names consistency in the dataset.
Can you point out what the inconsistencies are?
@MStock78120 and @jmilesBIPM would be interested in finding out the issues with the bibliographic encodings at Metrologia.
Thanks.
Some examples are here in format "given name", "surname" file
:
data/2022-04-05T10_55_52_content/0026-1394/0026-1394_43/0026-1394_43_5/0026-1394_43_5_426/met6_5_014.xml
data/2022-04-05T10_55_52_content/0026-1394/0026-1394_47/0026-1394_47_1A/0026-1394_47_1A_08005/0026-1394_47_1A_08005.xml
data/2022-04-05T10_55_52_content/0026-1394/0026-1394_49/0026-1394_49_1A/0026-1394_49_1A_08001/0026-1394_49_1A_08001.xml
data/2022-04-05T10_55_52_content/0026-1394/0026-1394_57/0026-1394_57_6/0026-1394_57_6_065032/met_57_6_065032.xml
data/2022-04-05T10_55_52_content/0026-1394/0026-1394_49/0026-1394_49_6/0026-1394_49_6_702/met12_6_702.xml
data/2022-04-05T10_55_52_content/0026-1394/0026-1394_54/0026-1394_54_1A/0026-1394_54_1A_08020/0026-1394_54_1A_08020.xml
data/2022-04-05T10_55_52_content/0026-1394/0026-1394_55/0026-1394_55_1A/0026-1394_55_1A_08018/0026-1394_55_1A_08018.xml
data/2022-04-05T10_55_52_content/0026-1394/0026-1394_1/0026-1394_1_4/0026-1394_1_4_158/metv1i4p158.xml
data/2022-04-05T10_55_52_content/0026-1394/0026-1394_14/0026-1394_14_4/0026-1394_14_4_179/metv14i4p179.xml
data/2022-04-05T10_55_52_content/0026-1394/0026-1394_36/0026-1394_36_6/0026-1394_36_6_599/me9623.xml
data/2022-04-05T10_55_52_content/0026-1394/0026-1394_40/0026-1394_40_1A/0026-1394_40_1A_04003/0026-1394_40_1A_04003.xml
data/2022-04-05T10_55_52_content/0026-1394/0026-1394_42/0026-1394_42_1A/0026-1394_42_1A_08003/0026-1394_42_1A_08003.xml
data/2022-04-05T10_55_52_content/0026-1394/0026-1394_44/0026-1394_44_1/0026-1394_44_1_1/met_44_1_1.xml
data/2022-04-05T10_55_52_content/0026-1394/0026-1394_47/0026-1394_47_1A/0026-1394_47_1A_08013/0026-1394_47_1A_08013.xml
data/2022-04-05T10_55_52_content/0026-1394/0026-1394_51/0026-1394_51_1A/0026-1394_51_1A_08002/0026-1394_51_1A_08002.xml
data/2022-04-05T10_55_52_content/0026-1394/0026-1394_51/0026-1394_51_1A/0026-1394_51_1A_08002/0026-1394_51_1A_08002.xml
data/2022-04-05T10_55_52_content/0026-1394/0026-1394_4/0026-1394_4_3/0026-1394_4_3_147/metv4i3p147.xml
data/2022-04-05T10_55_52_content/0026-1394/0026-1394_44/0026-1394_44_1A/0026-1394_44_1A_08001/0026-1394_44_1A_08001.xml
data/2022-04-05T10_55_52_content/0026-1394/0026-1394_44/0026-1394_44_1A/0026-1394_44_1A_08001/0026-1394_44_1A_08001.xml
data/2022-04-05T10_55_52_content/0026-1394/0026-1394_44/0026-1394_44_1A/0026-1394_44_1A_08001/0026-1394_44_1A_08001.xml
data/2022-04-05T10_55_52_content/0026-1394/0026-1394_28/0026-1394_28_3/0026-1394_28_3_183/metv28i3p183.xml
data/2022-04-05T10_55_52_content/0026-1394/0026-1394_28/0026-1394_28_3/0026-1394_28_3_183/metv28i3p183.xml
data/2022-04-05T10_55_52_content/0026-1394/0026-1394_17/0026-1394_17_3/0026-1394_17_3_81/metv17i3p81.xml
data/2022-04-05T10_55_52_content/0026-1394/0026-1394_10/0026-1394_10_3/0026-1394_10_3_99/metv10i3p99.xml
data/2022-04-05T10_55_52_content/0026-1394/0026-1394_40/0026-1394_40_1A/0026-1394_40_1A_09001/0026-1394_40_1A_09001.xml
data/2022-04-05T10_55_52_content/0026-1394/0026-1394_51/0026-1394_51_1A/0026-1394_51_1A_06021/0026-1394_51_1A_06021.xml
data/2022-04-05T10_55_52_content/0026-1394/0026-1394_46/0026-1394_46_3/0026-1394_46_3_315/met_46_3_315.xml
data/2022-04-05T10_55_52_content/0026-1394/0026-1394_47/0026-1394_47_1A/0026-1394_47_1A_04009/0026-1394_47_1A_04009.xml
data/2022-04-05T10_55_52_content/0026-1394/0026-1394_47/0026-1394_47_1A/0026-1394_47_1A_09003/0026-1394_47_1A_09003.xml
data/2022-04-05T10_55_52_content/0026-1394/0026-1394_48/0026-1394_48_1A/0026-1394_48_1A_08015/0026-1394_48_1A_08015.xml
data/2022-04-05T10_55_52_content/0026-1394/0026-1394_50/0026-1394_50_1A/0026-1394_50_1A_08013/0026-1394_50_1A_08013.xml
data/2022-04-05T10_55_52_content/0026-1394/0026-1394_51/0026-1394_51_1A/0026-1394_51_1A_08017/0026-1394_51_1A_08017.xml
data/2022-04-05T10_55_52_content/0026-1394/0026-1394_51/0026-1394_51_1A/0026-1394_51_1A_08016/0026-1394_51_1A_08016.xml
data/2023-01-27T03_01_52_content/0026-1394/0026-1394_60/0026-1394_60_1A/0026-1394_60_1A_08006/0026-1394_60_1A_08006.xml
You can see that name parts can be all in surname or distributed between surname and given name unpredictably. Affixes can be capitalized or not. Even if we managed to make rules to parse forename, surname, initials, prefixes, and additions, I doubt that we'll be able to restore original name form the parts. I think if we use given-name + surname string as full name, it will give us original name. We have fullname
element in our data model to keep the string.
FYI since the issue we use branch 2023-04-23
@jmilesBIPM it seems that the Metrologia is the culprit -- can we request them to update this data to correct the names?
The online version of the article in question (https://doi.org/10.1088/1681-7575/ab0013) displays the full name (given name plus surname) of all the authors correctly: e.g. "Martin J T Milton"
Isn't it possible for you just to display these two fields as provided in the XML files?
I see in the first message from @anermina that the XML gave
<name>
<forename language="en" script="Latn" initial="J">Martin</forename>
<forename initial="T"/>
<surname language="en" script="Latn">Milton</surname>
</name>
This is admittedly more long winded than one might expect, but the answer seems to be
forename + forename initial given with forename + forename initial given separately + surname
= Martin J T Milton
@ronaldtse why don't we just save "given_name" + "surname" as a "fullname"? I don't think it's possible to parse all the names correctly.
In that case we'd have "Martin T Milton" here? Close enough!
@jmilesBIPM in the source we have
<name name-style="western">
<surname>Milton</surname>
<given-names>Martin J T</given-names>
</name>
So "given-name" + "surname" will be "Martin J T Milton".
@ronaldtse I just noticed that there is "name-style" attribute. May be we can use it to parse name parts correctly.
There is only "western" name style. Names now are created by concatenating "given-name" and "surname". The name now is "Martin J T Milton".
From Michael Stock:
The very last reference (no. 112 in the English text, no.111 in the French text) has a typo: it should be Milton M not Milton J.