relaton / relaton-data-bipm

Relaton bibliographic data for BIPM
2 stars 1 forks source link

Parse BIPM Metrologia data from rawdata-bipm-metrologia #17

Open ronaldtse opened 2 years ago

ronaldtse commented 2 years ago

Metrologia is a journal issued by the BIPM that is published at IOP Publishing. The bibliographic data format is in JATS XML.

We may want to take this opportunity to create a JATS Bibliographic item to Relaton converter (i.e. we need to do it right, e.g. not dropping namespaces during parsing).

Please see https://github.com/relaton/relaton-bipm/issues/28 for more details on the formats.

andrew2net commented 1 year ago

@ronaldtse I have no idea how to distinguish affiliation organizations' names and parts of addresses. For example, 0026-1394_59_1A_08005.xml contains affiliations:

National Research Council Canada, Metrology, 1200 Montreal Rd., Ottawa, K1A 0R6, Canada
National Metrology Centre, Agency for Science, Technology and Research, 8 Cleantech Loop, 637145, Singapore
Centre for Isotope Research, University of Groningen, Nijenborgh 6, 9747 AG Groningen, The Netherlands
Stable Isotope Laboratory, Max Planck Institute for Biogeochemistry, Hans-Knoell-St. 10, 07745, Jena, Germany
US Geological Survey, Reston, VA 20192, USA

The last part is always a country. Other parts are unclear. Do we have a list of the organizations' names to extract them from the affiliation strings?

ronaldtse commented 1 year ago

@andrew2net I don't think we can really differentiate the organization names and addresses from this content given.

The <aff> is just an unstructured field. However, there are tagged contents in <aff> in ~1,000 files:

<aff id="affiliation06"><label>6</label>
<institution xlink:type="simple">National Institute of Standards and Technology (NIST)</institution>, 100 Bureau Drive, Gaithersburg, MD, 20899, <country>United States of America</country>

Here it includes:

I have extracted all the <aff> elements and there are 6219 unique entries out of 9480:

The last part is always a country

This is not always the case. Some entries do not have a country, e.g.:

Univ. Reading
Univ. Sao Paulo
VNIIOFI
Working Group 2 of the Comité Consultatif de Thermométrie
Working Group III of the Consultative Committee for Thermometry
Working Group on Unstabilized Lasers of the Consultative Committee for Length
Warsaw Univ. Technol.
Wroclaw Tec. Univ.

Some of these are mis-encoded -- sometimes multiple affiliations are put into the same label, e.g.:

Univ. Alabama-Huntsville, FTI, NIST
Univ. Arizona, South Dakota State Univ., JET Propulsion Lab., Saga Univ., GSJ
Univ. Colorado, Washington State Univ., Univ. Paris XI
Univ. Florida, NIST
Univ. Hannover, PTB
Univ. Helsinki, MRI-HUT, CMA, SP, DFM, BIPM
Univ. Paris XII, BNM-LNE
Univ. Reading, NIST, SIS
Univ. Zaragoza, Univ. Florence, IMGC
VNIIFTRI (National Research Institute Physicotechnical and Radio Engineering Measurements), Russian Federation
VNIIFTRI, PTB
VNIIM (All-Russia D I Mendeleev Scientific and Research Institute for Metrology), Russia
VNIIM, NCM, NMS, NILPRP, BIPM
VNIIM, PTB, SMU

Some are broken:

National Institute of Standards and TechnologyNIST is part of the US Department of Commerce., 100 Bureau Drive, Gaithersburg, MD 20899, USA

Some are alternative forms of the same institution:

<institution xlink:type="simple">INRIM—Istituto Nazionale di Ricerca Metrologica</institution>, Str. delle Cacce 91, 10135 Torino, <country>Italy</country>
<institution xlink:type="simple">INRIM—Istituto Nazionale di Ricerca Metrologica</institution>, Strada delle Cacce 91, 10135 Torino, <country>Italy</country>
<institution xlink:type="simple">INRIM—Istituto Nazionale di Ricerca Metrologica</institution>, Strada delle Cacce, 91, 10135 Torino, <country>Italy</country>
<institution xlink:type="simple">INRIM</institution>, Strada delle Cacce 91, 10135 Torino, <country>Italy</country>
<institution xlink:type="simple">INRIM</institution>, Strada delle Cacce 91, I-10135 Torino, <country>Italy</country>
<institution xlink:type="simple">INRIM</institution>, Torino, <country>Italy</country>
<institution xlink:type="simple">INRIM</institution>, Turin, <country>Italy</country>

@andrew2net @opoudjis do we have a way of representing an affiliation with a "name with address" without separating them?

opoudjis commented 1 year ago

do we have a way of representing an affiliation with a "name with address" without separating them?

No.

ronaldtse commented 1 year ago

In this case let me suggest the following.

For aff that contain tags

Assign these fields:

For aff that has no tags

All content: the name of the organization.

Some data cleansing necessary

Wrong institution tags (ignore for now)

e.g.

<institution xlink:type="simple">1005 Southover Lane</institution>, Victoria, BC, V8Y 3C3, <country>Canada</country>

Content stripping needed (can do)

Remove "Permanent address: ".

Permanent address: <institution xlink:type="simple">Główny Urząd Miar</institution>, <country>Poland</country>
Permanent address: 36 Zunuqua Trail, PO Box 307, Orcas, WA 98280-0307, USA.
Permanent address: Institute of Laser Physics, Siberian Branch of the Russian Academy of Sciences, Novosibirsk, Russia.
Permanent address: Institute of Scientific Instruments, Brno, Czech Republic.
Permanent address: Measurement Standards Laboratory, PO Box 31310, Lower Hutt, New Zealand.
Permanent address: Physics Department, East China Normal University, Shanghai 200062, People's Republic of China.
Permanent address: Tsinghua University, Department of Precision Instrument and Mechanology, Beijing 100084, People's Republic of China.
Permanent address: Tsinghua University, Department of Precision Instrument and Mechanology, Beijing 100084, People's Republic of China.

Non affiliation content (ignore for now)

Author to whom any correspondence should be addressed.
Author to whom any correspondence should be addressed. Address for correspondence: Center for Bioanalysis, Korea Research Institute of Standards and Science, 1 Doryong-dong, Yusung-gu, Daejeon 305-401, Korea.
Authors to whom any correspondence should be addressed.

Missing affiliation (ignore for now)

Germany
Guest Editors

Affiliations with qualifications (ignore for now)

Currently a Guest Researcher at NIST. Permanent address: Measurement Standards Laboratory, PO Box 31310, Lower Hutt 5040, New Zealand.
Deceased, formerly at: Department of Electricity, Radiation and Length, Van Swinden Laboratorium, Thijsseweg 11, 2629 JA Delft, the Netherlands
Guest researcher at NIST.
Guest researcher at PTB
Guest scientist at NMIJ.
Guest Scientist at the National Institute of Standards and Technology
Guest scientist, formerly with the National Physical Laboratory, Teddington, Middlesex TW11 0LW, UK
Guest scientist, Hampton, Middlesex TW12 2TY, UK
Guest scientist, Hampton, Middlesex TW12 2TY, UK

"Guest researcher" or "Guest scientist" should be considered the qualification of the affiliation.