inspirehep / inspire-next

The INSPIRE repo.
https://inspirehep.net
GNU General Public License v3.0
59 stars 69 forks source link

Deduce institution from raw affiliation #1875

Open michamos opened 7 years ago

michamos commented 7 years ago

(Following conversation with @kaplun and @jacquerie to formalize the requirements.)

An important part of the work of cataloguers is adding affiliation identifiers (ICNs) for authors based on the affiliations written on the paper. It would be extremely worthwile to automatize this process.

There are two required steps:

  1. Extract the affiliation string from the PDF (or possibly LaTeX for an arXiv paper) to populate the raw_affiliation. One might use GROBID for this step.

  2. Normalize the raw_affilation into the ICN. The state of the art is @fschwenn's afftranslator2 (see #1873). @jacquerie suggests to use ES instead.

Example:

I can give more complex examples (several authors, each having several affiliations) if needed.

kaplun commented 7 years ago

Here's a simple run of afftranslator on that paper:

In [2]: import afftranslator2

In [3]: afftranslator2.bestmatch("""Institute for Fundamental Theory, Department of Physics, University of Florida,
   ...: Gainesville, Florida 32611, USA""", 'ICN')
--------------------------
aff   =  Institute for Fundamental Theory, Department of Physics, University of Florida,
Gainesville, Florida 32611, USA
naff  =  Institute for Fundamental Theory , Department of Physics , University of Florida , 
Gainesville , Florida 32611 , USA
naff1 =  _INS Fundamental _THE _DEP _PHY _UNI Florida 
Gainesville Florida 32611 USA Institute for Fundamental Theory , Department of Physics , University of Florida , 
32611
saff  =  IFUNDAMENTALTDPUFLORIDA
GAINESVILLEFLORIDA32611USAINSTITUTEFORFUNDAMENTALTHEORYDEPARTMENTOFPHYSICSUNIVERSITYOFFLORIDA
32611
country    ==  US
type       ==  u
rel. word  ==  [u'_UNI', u'_PHY', u'_DEP', u'_INS', u'_THE', u'University', u'Fundamental', u'Physics', u'Florida', u'Department', u'Institute', u'for', u'32611']
Out[3]: 
[(28.88496182942599, 'Florida U., Inst. Fund. Theor.', 61.03298176010214),
 (28.045347081571062, 'Florida U.', 61.03298176010214),
 (6.608779256795568, 'Stanford U., Phys. Dept.', 61.03298176010214),
 (6.150175512216395, 'Harvard U., Phys. Dept.', 61.03298176010214),
 (5.019476556012691, 'Minnesota U., Theor. Phys. Inst.', 61.03298176010214),
 (2.5471385711849126, 'Cornell U., Phys. Dept.', 61.03298176010214),
 (1.82707190358193, 'Stanford U., Appl. Phys. Dept.', 61.03298176010214),
 (-0.10405603206923897, 'Ohio U., Inst. Nucl. Part. Phys.', 61.03298176010214),
 (-1.029932390256058, 'Stanford U., Inst. Plasma Physics', 61.03298176010214),
 (-2.3727775909587088,
  'U. Louisiana, Lafayette, Dept. Phys.',
  61.03298176010214),
 (-7.339985149545358, 'U. Texas, El Paso, Dept. Phys.', 61.03298176010214)]
kaplun commented 7 years ago

Here is a benchmark that can be used to evaluate afftranslator and any future better implementations. https://gist.github.com/kaplun/41d6a26114f81e1d184bba75ad2403f9

kaplun commented 7 years ago

So the script guessed between 74 and 77% of the ICNs from the raw affiliation, assuming the benchmark to be correct.

Now @annetteholtkamp points out that the current mapping could be actually wrong. So we need an estimation of how wrong it is.

jacquerie commented 7 years ago

https://gist.github.com/kaplun/41d6a26114f81e1d184bba75ad2403f9

I'm not sure if I understand the format: keys are normalized ICNs (the target), values are raw affiliation strings (the source)?

jacquerie commented 7 years ago

So the script guessed between 74 and 77% of the ICNs from the raw affiliation, assuming the benchmark to be correct.

Just so we compare apples to apples, what counts as a success? Returning the right ICN at the top position, or just among the results?

michamos commented 7 years ago

I'm not sure if I understand the format: keys are normalized ICNs (the target), values are raw affiliation strings (the source)?

Indeed, but we saw mistakes in there. So @annetteholtkamp and I will curate a list of 1000 random affiliation strings to be sure there are no mistakes in the benchmark data.

Just so we compare apples to apples, what counts as a success? Returning the right ICN at the top position, or just among the results?

Top position.

kaplun commented 7 years ago

Was replying, but @michamos replied already all :+1:

michamos commented 7 years ago

I did the random selection of affilitations, will attach the curated list once we have it.

kaplun commented 7 years ago

Higher probability of cleaner data. https://gist.github.com/kaplun/cbda8713656bf01ebfbc045dd8aa0c6d

michamos commented 7 years ago

https://gist.github.com/michamos/bbac4b1ff563b2263a2276f8c601ffa4 contains two JSON lists:

we have 102 errors out of 1200 uncurated mappings, i.e. an error rate of 8.5% ± 2.9% (roughly) :sob:, so if you can get similar accuracy without human intervention that would free lots of cataloger time.

jacquerie commented 7 years ago

I can report that the "30 seconds implementation" I showed you yesterday has 72% precision on the curated dataset, which looks comparable to what @kaplun reported for afftranslator2, but requires no extra work. For reference:

{
    "_source": [
        "ICN"
    ],
    "query": {
        "match": {
            "_all": ...
        }
    }
}

I'm confident that with a tiny bit of tweaking we can ship something that beats afftranslator2 + human.

kaplun commented 7 years ago

to be noted that I have run afftranslator on the potentially mismatched list. I will give it also a try to the curated lista of @michamos to have the final baseline.

kaplun commented 7 years ago

we have 102 errors out of 1200 uncurated mappings

actually these where supposedly curated mappings.

michamos commented 7 years ago

actually these where supposedly curated mappings.

by uncurated, I mean before @annetteholtkamp and I looked at the list. But you are right that they have in principle been curated by our cataloguers, which makes it rather clear that this is a hard task for humans (the way it works now, just picking an ICN from the autocomplete list in the record editor).

kaplun commented 7 years ago

And we have the results: :drum:

fschwenn commented 7 years ago

only 958 of the 1200 lines are uniqe, e.g. "CERN --- CERN, Geneva, Switzerland" appears 75 times

-- Florian Schwennsen Deutsches Elektronen-Synchrotron DESY Building 01 Room O1.446 phone: +49-40-8998-6190

From: "Samuele Kaplun" notifications@github.com To: "inspirehep/inspire-next" inspire-next@noreply.github.com Cc: "Florian Schwennsen" florian.schwennsen@desy.de, "Mention" mention@noreply.github.com Sent: Thursday, 19 January, 2017 13:45:57 Subject: Re: [inspirehep/inspire-next] Deduce institution from raw affiliation (#1875)

Higher probability of cleaner data. https://gist.github.com/kaplun/cbda8713656bf01ebfbc045dd8aa0c6d

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub , or mute the thread .

kaplun commented 7 years ago

Mmh. I think they should differ by at least one space. Entries have been genuinely taken from real records, and then put in a set to remove duplicates.

fschwenn commented 7 years ago

I would give the benchmark itself a 94%.

* I would say "OMEGA" is without more investigation just not assignable 
* I have the impression that "JAERI, Tokai" and "JAEA, Ibaraki" is a duplicate. 
* I noted the following mistakes in benchmark.txt. Correct ICN in curly brackets: 

{Airbus, Immenstaad} Astrium, Immenstaad --- Airbus Defence and Space - Claude-Dornier-Strasse - 88090 Immenstaad - Germany {U. Bern, AEC} Bern U. --- Albert Einstein Center for Fundamental Physics - ITP, University of Bern, Switzerland {Unlisted, DE} BESSY, Berlin --- EvoLogics GmbH, Berlin, Germany {BNL, C-A Dept.} BNL, NSLS --- Brookhaven National Laboratory, 911 B, Upton, NY 11973, USA {U. Bologna (main)} Bologna U. --- U. of Bologna {Brookhaven Natl. Lab.} Brookhaven --- BNL, Upton, Long Island, New York, USA {Brookhaven Natl. Lab.} Brookhaven --- BNL, Upton, Long Island, New York, USA {Brookhaven Natl. Lab.} Brookhaven --- Brookhaven National Laboratory {Brookhaven Natl. Lab.} Brookhaven --- Brookhaven National Laboratory , Upton, New York 11973 USA {Brookhaven Natl. Lab.} Brookhaven --- Brookhaven National Laboratory , Upton, New York 11973, USA {Brookhaven Natl. Lab.} Brookhaven --- Brookhaven National Laboratory - Upton - NY 11973 - USA {Brookhaven Natl. Lab.} Brookhaven --- Brookhaven National Laboratory, Upton, NY, USA {Pitesti, Inst. Nucl. Power Reactors} Bucharest, IFIN-HH --- Institute for Nuclear Research - Piteşti - Romania - Campului Str - No. 1 - POB 78 - 115400 - Mioveni - Arges County - Romania {Wigner RCP, Budapest} Budapest, RMKI --- Institute for Nuclear Research of the Hungarian Academy of Sciences {Wigner RCP, Budapest} Budapest, RMKI --- Institute for Nuclear Research of the Hungarian Academy of Sciences {Wigner RCP, Budapest} Budapest, RMKI --- Wigner RCP, RMKI, H-1121 Budapest, Konkoly Thege Miklós út 29-33, Hungary {DESY; DESY, Zeuthe} DESY, Zeuthen --- DESY - Hamburg and Zeuthen - Germany {DESY; DESY, Zeuthe} DESY, Zeuthen --- DESY, Hamburg and Zeuthen, Germany {DESY} DESY, Zeuthen --- Deutsches Elektronen-Synchrotron (Germany) {TU, Dresden (main)} Dresden, Tech. U. --- Technische Universität Dresden, 01062, Dresden, Germany {TU, Dresden (main)} Dresden, Tech. U. --- Technische Universität Dresden, 01062, Dresden, Germany {TU, Dresden (main)} Dresden, Tech. U. --- TU Dresden {U. Erlangen-Nuremberg (main)} Erlangen - Nuremberg U. --- University of Erlangen-Nuernberg {U. Erlangen-Nuremberg (main)} Erlangen - Nuremberg U. --- University of Erlangen-Nürnberg {HZDR, Dresden} Forschungszentrum Dresden Rossendorf --- Helmholtz-Zentrum Dresden-Rossendorf - D-01328 Dresden - Germany {HZDR, Dresden} Forschungszentrum Dresden Rossendorf --- Institut für Strahlenphysik, Helmholtz-Zentrum Dresden-Rossendorf, 01314 Dresden, Germany {U. Geneva (main)} Geneva U. --- Univ. de Genéve (Switzerland) {U. Geneva (main)} Geneva U. --- Univ. de Genève (Switzerland) {U. Geneva (main)} Geneva U. --- Univ. de Genève (Switzerland) {U. Giessen (main)} Giessen U. --- Justus-Liebig University, Giessen, Germany {U. Hamburg (main)} Hamburg U. --- Hamburg U. {U. Hamburg (main)} Hamburg U. --- Hamburg University {U. Hamburg (main)} Hamburg U. --- University of Hamburg {DESY} Hasylab, DESY --- Deutsches Elektronen-Synchrotron {Calcutta, VECC} HBNI, Mumbai --- Theoretical High Energy Physics Division , Variable Energy Cyclotron Centre, HBNI, 1/AF Bidhannagar Kolkata - 700064, India {Milan U.; INFN, Milan} INFN, Milan --- Dipartimento di Fisica - Università degli Studi e INFN - Milano 20133 - Italy {INFN, Italy} INFN, Turin --- INFN {INFN, Italy} INFN, Turin --- INFN {IRFU, Saclay} IRFU, SPhN, Saclay --- Commissariat à l’Énergie Atomique et aux Énergies Alternatives - Centre de Saclay - IRFU - 91191 Gif-sur-Yvette - France {IRFU, Saclay} IRFU, SPhN, Saclay --- IRFU, Saclay {IRFU, Saclay} IRFU, SPhN, Saclay --- IRFU, Saclay {Jagiellonian U. (main)} Jagiellonian U. --- Jagiellonian University, 30059, Krakow, Poland {Jagiellonian U. (main)} Jagiellonian U. --- Jagiellonian University, Krakow, Poland {KIT, Karlsruhe} Karlsruhe, Forschungszentrum --- Karlsruhe Institute of Technology {KIT, Karlsruhe, IKP} Karlsruhe, Forschungszentrum --- Karlsruher Institut für Technologie, Institut für Kernphysik, Postfach 3640, 76021 Karlsruhe, Germany {Unlisted} Karlsruhe U. --- Institute for Nuclear Physics {KIT, Karlsruhe} Karlsruhe U. --- Karlsruhe Institute of Technology {KIT, Karlsruhe, TTP} Karlsruhe U., TTP --- Institut für Theoretische Physik - Karlsruher Institut für Technologie - 76128 - Karlsruhe - Germany Karlsruhe U. --- Universität Karlsruhe, Karlsruhe, Germany {KAERI, Taejon} KASI, DaeJeon --- Neutron Science Division - Korea Atomic Energy Research Institute - Daejeon 305-353 - Korea {KIT, Karlsruhe} KIT, Karlsruhe, IPE --- KIT, Eggenstein-Leopoldshafen, Germany {Frascati} LNF, Dafne Light --- INFN Laboratori Nazionali di Frascati, Frascati, Italy {FRascati} LNF, Dafne Light --- INFN Laboratori Nazionali di Frascati - Frascati (RM) - Italy {NCBJ, Lodz} Lodz, IPJ --- National Centre for Nuclear Research, Department of Astrophysics, Lodz, Poland {Mainz U., Inst. Phys.; U. MAINZ, PRISMA} Mainz U., Inst. Phys. --- Institute of Physics and Excellence Cluster PRISMA - Johannes Gutenberg-Universität Mainz - 55099 Mainz - Germany {Glasgow U.} Manchester U. --- University of Glasgow {MIT, CTP} MIT --- MIT CTP {Munich, Tech. U; Munich, Tech. U., Universe} Munich, Tech. U. --- Physik-Department and Excellence Cluster Universe - Technische Universität München - 85747 - Garching - Germany {Unlisted, FR} Orsay, LURE --- The Sciences ACO association, France {Paraiba U.} Paraiba State U. --- Departamento de Física, Universidade Federal da Paraíba, Caixa Postal 5008, João Pessoa-PB, 58051-900, Brazil {U. Regensburg (main)} Regensburg U. --- University of Regensburg, Germany {Rio de Janeiro, CBPF} Rio de Janeiro Observ. --- Centro Brasileiro de Pesquisas Físicas—CBPF/MCTI , 22290-180 Rio de Janeiro, Brazil {Unlisted, FR} Tours U., CNRS --- CNRS {U. Witwatersrand, Johannesburg, Sch. Phys.} Witwatersrand U. --- School of Physics - U. of the Witwatersrand - Johannesburg 2050 - South Africa {U. Witwatersrand, Johannesburg, Sch. Phys.} Witwatersrand U. --- School of Physics - U. of the Witwatersrand - Johannesburg 2050 - South Africa {U. Witwatersrand, Johannesburg, Sch. Phys.} Witwatersrand U. --- School of Physics - U. of the Witwatersrand - Johannesburg 2050 - South Africa {U. Witwatersrand, Johannesburg, Sch. Phys.} Witwatersrand U. --- School of Physics - U. of the Witwatersrand - Johannesburg 2050 - South Africa {U. Witwatersrand, Johannesburg, Sch. Phys.} Witwatersrand U. --- School of Physics - U. of the Witwatersrand - Johannesburg 2050 - South Africa {U. Wurzburg (main)} Wurzburg U. --- Universität Würzburg {ETH, Zurich (main)} Zurich, ETH --- ETH Zurich {ETH, Zurich (main)} Zurich, ETH --- ETH Zürich (Switzerland)

-- Florian Schwennsen Deutsches Elektronen-Synchrotron DESY Building 01 Room O1.446 phone: +49-40-8998-6190

From: "Samuele Kaplun" notifications@github.com To: "inspirehep/inspire-next" inspire-next@noreply.github.com Cc: "Florian Schwennsen" florian.schwennsen@desy.de, "Mention" mention@noreply.github.com Sent: Friday, 20 January, 2017 09:52:05 Subject: Re: [inspirehep/inspire-next] Deduce institution from raw affiliation (#1875)

And we have the results: 🥁

* 83% recognized successfully by afftranslator in the curated list.
  • 44% afftranslator recognized the same wrong ICN from the bad list (i.e. afftranslator + human non checking is responsible for 44% of the wrong ICNs).

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub , or mute the thread .