Open michamos opened 7 years ago
Here's a simple run of afftranslator on that paper:
In [2]: import afftranslator2
In [3]: afftranslator2.bestmatch("""Institute for Fundamental Theory, Department of Physics, University of Florida,
...: Gainesville, Florida 32611, USA""", 'ICN')
--------------------------
aff = Institute for Fundamental Theory, Department of Physics, University of Florida,
Gainesville, Florida 32611, USA
naff = Institute for Fundamental Theory , Department of Physics , University of Florida ,
Gainesville , Florida 32611 , USA
naff1 = _INS Fundamental _THE _DEP _PHY _UNI Florida
Gainesville Florida 32611 USA Institute for Fundamental Theory , Department of Physics , University of Florida ,
32611
saff = IFUNDAMENTALTDPUFLORIDA
GAINESVILLEFLORIDA32611USAINSTITUTEFORFUNDAMENTALTHEORYDEPARTMENTOFPHYSICSUNIVERSITYOFFLORIDA
32611
country == US
type == u
rel. word == [u'_UNI', u'_PHY', u'_DEP', u'_INS', u'_THE', u'University', u'Fundamental', u'Physics', u'Florida', u'Department', u'Institute', u'for', u'32611']
Out[3]:
[(28.88496182942599, 'Florida U., Inst. Fund. Theor.', 61.03298176010214),
(28.045347081571062, 'Florida U.', 61.03298176010214),
(6.608779256795568, 'Stanford U., Phys. Dept.', 61.03298176010214),
(6.150175512216395, 'Harvard U., Phys. Dept.', 61.03298176010214),
(5.019476556012691, 'Minnesota U., Theor. Phys. Inst.', 61.03298176010214),
(2.5471385711849126, 'Cornell U., Phys. Dept.', 61.03298176010214),
(1.82707190358193, 'Stanford U., Appl. Phys. Dept.', 61.03298176010214),
(-0.10405603206923897, 'Ohio U., Inst. Nucl. Part. Phys.', 61.03298176010214),
(-1.029932390256058, 'Stanford U., Inst. Plasma Physics', 61.03298176010214),
(-2.3727775909587088,
'U. Louisiana, Lafayette, Dept. Phys.',
61.03298176010214),
(-7.339985149545358, 'U. Texas, El Paso, Dept. Phys.', 61.03298176010214)]
Here is a benchmark that can be used to evaluate afftranslator and any future better implementations. https://gist.github.com/kaplun/41d6a26114f81e1d184bba75ad2403f9
So the script guessed between 74 and 77% of the ICNs from the raw affiliation, assuming the benchmark to be correct.
Now @annetteholtkamp points out that the current mapping could be actually wrong. So we need an estimation of how wrong it is.
https://gist.github.com/kaplun/41d6a26114f81e1d184bba75ad2403f9
I'm not sure if I understand the format: keys are normalized ICNs (the target), values are raw affiliation strings (the source)?
So the script guessed between 74 and 77% of the ICNs from the raw affiliation, assuming the benchmark to be correct.
Just so we compare apples to apples, what counts as a success? Returning the right ICN at the top position, or just among the results?
I'm not sure if I understand the format: keys are normalized ICNs (the target), values are raw affiliation strings (the source)?
Indeed, but we saw mistakes in there. So @annetteholtkamp and I will curate a list of 1000 random affiliation strings to be sure there are no mistakes in the benchmark data.
Just so we compare apples to apples, what counts as a success? Returning the right ICN at the top position, or just among the results?
Top position.
Was replying, but @michamos replied already all :+1:
I did the random selection of affilitations, will attach the curated list once we have it.
Higher probability of cleaner data. https://gist.github.com/kaplun/cbda8713656bf01ebfbc045dd8aa0c6d
https://gist.github.com/michamos/bbac4b1ff563b2263a2276f8c601ffa4 contains two JSON lists:
[ICN, raw_affiliation]
s (which are each unicode python strings)we have 102 errors out of 1200 uncurated mappings, i.e. an error rate of 8.5% ± 2.9% (roughly) :sob:, so if you can get similar accuracy without human intervention that would free lots of cataloger time.
I can report that the "30 seconds implementation" I showed you yesterday has 72% precision on the curated dataset, which looks comparable to what @kaplun reported for afftranslator2
, but requires no extra work. For reference:
{
"_source": [
"ICN"
],
"query": {
"match": {
"_all": ...
}
}
}
I'm confident that with a tiny bit of tweaking we can ship something that beats afftranslator2
+ human.
to be noted that I have run afftranslator on the potentially mismatched list. I will give it also a try to the curated lista of @michamos to have the final baseline.
we have 102 errors out of 1200 uncurated mappings
actually these where supposedly curated mappings.
actually these where supposedly curated mappings.
by uncurated, I mean before @annetteholtkamp and I looked at the list. But you are right that they have in principle been curated by our cataloguers, which makes it rather clear that this is a hard task for humans (the way it works now, just picking an ICN from the autocomplete list in the record editor).
And we have the results: :drum:
only 958 of the 1200 lines are uniqe, e.g. "CERN --- CERN, Geneva, Switzerland" appears 75 times
-- Florian Schwennsen Deutsches Elektronen-Synchrotron DESY Building 01 Room O1.446 phone: +49-40-8998-6190
From: "Samuele Kaplun" notifications@github.com To: "inspirehep/inspire-next" inspire-next@noreply.github.com Cc: "Florian Schwennsen" florian.schwennsen@desy.de, "Mention" mention@noreply.github.com Sent: Thursday, 19 January, 2017 13:45:57 Subject: Re: [inspirehep/inspire-next] Deduce institution from raw affiliation (#1875)
Higher probability of cleaner data. https://gist.github.com/kaplun/cbda8713656bf01ebfbc045dd8aa0c6d
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub , or mute the thread .
Mmh. I think they should differ by at least one space. Entries have been genuinely taken from real records, and then put in a set
to remove duplicates.
I would give the benchmark itself a 94%.
* I would say "OMEGA" is without more investigation just not assignable
* I have the impression that "JAERI, Tokai" and "JAEA, Ibaraki" is a duplicate.
* I noted the following mistakes in benchmark.txt. Correct ICN in curly brackets:
{Airbus, Immenstaad} Astrium, Immenstaad --- Airbus Defence and Space - Claude-Dornier-Strasse - 88090 Immenstaad - Germany {U. Bern, AEC} Bern U. --- Albert Einstein Center for Fundamental Physics - ITP, University of Bern, Switzerland {Unlisted, DE} BESSY, Berlin --- EvoLogics GmbH, Berlin, Germany {BNL, C-A Dept.} BNL, NSLS --- Brookhaven National Laboratory, 911 B, Upton, NY 11973, USA {U. Bologna (main)} Bologna U. --- U. of Bologna {Brookhaven Natl. Lab.} Brookhaven --- BNL, Upton, Long Island, New York, USA {Brookhaven Natl. Lab.} Brookhaven --- BNL, Upton, Long Island, New York, USA {Brookhaven Natl. Lab.} Brookhaven --- Brookhaven National Laboratory {Brookhaven Natl. Lab.} Brookhaven --- Brookhaven National Laboratory , Upton, New York 11973 USA {Brookhaven Natl. Lab.} Brookhaven --- Brookhaven National Laboratory , Upton, New York 11973, USA {Brookhaven Natl. Lab.} Brookhaven --- Brookhaven National Laboratory - Upton - NY 11973 - USA {Brookhaven Natl. Lab.} Brookhaven --- Brookhaven National Laboratory, Upton, NY, USA {Pitesti, Inst. Nucl. Power Reactors} Bucharest, IFIN-HH --- Institute for Nuclear Research - Piteşti - Romania - Campului Str - No. 1 - POB 78 - 115400 - Mioveni - Arges County - Romania {Wigner RCP, Budapest} Budapest, RMKI --- Institute for Nuclear Research of the Hungarian Academy of Sciences {Wigner RCP, Budapest} Budapest, RMKI --- Institute for Nuclear Research of the Hungarian Academy of Sciences {Wigner RCP, Budapest} Budapest, RMKI --- Wigner RCP, RMKI, H-1121 Budapest, Konkoly Thege Miklós út 29-33, Hungary {DESY; DESY, Zeuthe} DESY, Zeuthen --- DESY - Hamburg and Zeuthen - Germany {DESY; DESY, Zeuthe} DESY, Zeuthen --- DESY, Hamburg and Zeuthen, Germany {DESY} DESY, Zeuthen --- Deutsches Elektronen-Synchrotron (Germany) {TU, Dresden (main)} Dresden, Tech. U. --- Technische Universität Dresden, 01062, Dresden, Germany {TU, Dresden (main)} Dresden, Tech. U. --- Technische Universität Dresden, 01062, Dresden, Germany {TU, Dresden (main)} Dresden, Tech. U. --- TU Dresden {U. Erlangen-Nuremberg (main)} Erlangen - Nuremberg U. --- University of Erlangen-Nuernberg {U. Erlangen-Nuremberg (main)} Erlangen - Nuremberg U. --- University of Erlangen-Nürnberg {HZDR, Dresden} Forschungszentrum Dresden Rossendorf --- Helmholtz-Zentrum Dresden-Rossendorf - D-01328 Dresden - Germany {HZDR, Dresden} Forschungszentrum Dresden Rossendorf --- Institut für Strahlenphysik, Helmholtz-Zentrum Dresden-Rossendorf, 01314 Dresden, Germany {U. Geneva (main)} Geneva U. --- Univ. de Genéve (Switzerland) {U. Geneva (main)} Geneva U. --- Univ. de Genève (Switzerland) {U. Geneva (main)} Geneva U. --- Univ. de Genève (Switzerland) {U. Giessen (main)} Giessen U. --- Justus-Liebig University, Giessen, Germany {U. Hamburg (main)} Hamburg U. --- Hamburg U. {U. Hamburg (main)} Hamburg U. --- Hamburg University {U. Hamburg (main)} Hamburg U. --- University of Hamburg {DESY} Hasylab, DESY --- Deutsches Elektronen-Synchrotron {Calcutta, VECC} HBNI, Mumbai --- Theoretical High Energy Physics Division , Variable Energy Cyclotron Centre, HBNI, 1/AF Bidhannagar Kolkata - 700064, India {Milan U.; INFN, Milan} INFN, Milan --- Dipartimento di Fisica - Università degli Studi e INFN - Milano 20133 - Italy {INFN, Italy} INFN, Turin --- INFN {INFN, Italy} INFN, Turin --- INFN {IRFU, Saclay} IRFU, SPhN, Saclay --- Commissariat à l’Énergie Atomique et aux Énergies Alternatives - Centre de Saclay - IRFU - 91191 Gif-sur-Yvette - France {IRFU, Saclay} IRFU, SPhN, Saclay --- IRFU, Saclay {IRFU, Saclay} IRFU, SPhN, Saclay --- IRFU, Saclay {Jagiellonian U. (main)} Jagiellonian U. --- Jagiellonian University, 30059, Krakow, Poland {Jagiellonian U. (main)} Jagiellonian U. --- Jagiellonian University, Krakow, Poland {KIT, Karlsruhe} Karlsruhe, Forschungszentrum --- Karlsruhe Institute of Technology {KIT, Karlsruhe, IKP} Karlsruhe, Forschungszentrum --- Karlsruher Institut für Technologie, Institut für Kernphysik, Postfach 3640, 76021 Karlsruhe, Germany {Unlisted} Karlsruhe U. --- Institute for Nuclear Physics {KIT, Karlsruhe} Karlsruhe U. --- Karlsruhe Institute of Technology {KIT, Karlsruhe, TTP} Karlsruhe U., TTP --- Institut für Theoretische Physik - Karlsruher Institut für Technologie - 76128 - Karlsruhe - Germany Karlsruhe U. --- Universität Karlsruhe, Karlsruhe, Germany {KAERI, Taejon} KASI, DaeJeon --- Neutron Science Division - Korea Atomic Energy Research Institute - Daejeon 305-353 - Korea {KIT, Karlsruhe} KIT, Karlsruhe, IPE --- KIT, Eggenstein-Leopoldshafen, Germany {Frascati} LNF, Dafne Light --- INFN Laboratori Nazionali di Frascati, Frascati, Italy {FRascati} LNF, Dafne Light --- INFN Laboratori Nazionali di Frascati - Frascati (RM) - Italy {NCBJ, Lodz} Lodz, IPJ --- National Centre for Nuclear Research, Department of Astrophysics, Lodz, Poland {Mainz U., Inst. Phys.; U. MAINZ, PRISMA} Mainz U., Inst. Phys. --- Institute of Physics and Excellence Cluster PRISMA - Johannes Gutenberg-Universität Mainz - 55099 Mainz - Germany {Glasgow U.} Manchester U. --- University of Glasgow {MIT, CTP} MIT --- MIT CTP {Munich, Tech. U; Munich, Tech. U., Universe} Munich, Tech. U. --- Physik-Department and Excellence Cluster Universe - Technische Universität München - 85747 - Garching - Germany {Unlisted, FR} Orsay, LURE --- The Sciences ACO association, France {Paraiba U.} Paraiba State U. --- Departamento de Física, Universidade Federal da Paraíba, Caixa Postal 5008, João Pessoa-PB, 58051-900, Brazil {U. Regensburg (main)} Regensburg U. --- University of Regensburg, Germany {Rio de Janeiro, CBPF} Rio de Janeiro Observ. --- Centro Brasileiro de Pesquisas Físicas—CBPF/MCTI , 22290-180 Rio de Janeiro, Brazil {Unlisted, FR} Tours U., CNRS --- CNRS {U. Witwatersrand, Johannesburg, Sch. Phys.} Witwatersrand U. --- School of Physics - U. of the Witwatersrand - Johannesburg 2050 - South Africa {U. Witwatersrand, Johannesburg, Sch. Phys.} Witwatersrand U. --- School of Physics - U. of the Witwatersrand - Johannesburg 2050 - South Africa {U. Witwatersrand, Johannesburg, Sch. Phys.} Witwatersrand U. --- School of Physics - U. of the Witwatersrand - Johannesburg 2050 - South Africa {U. Witwatersrand, Johannesburg, Sch. Phys.} Witwatersrand U. --- School of Physics - U. of the Witwatersrand - Johannesburg 2050 - South Africa {U. Witwatersrand, Johannesburg, Sch. Phys.} Witwatersrand U. --- School of Physics - U. of the Witwatersrand - Johannesburg 2050 - South Africa {U. Wurzburg (main)} Wurzburg U. --- Universität Würzburg {ETH, Zurich (main)} Zurich, ETH --- ETH Zurich {ETH, Zurich (main)} Zurich, ETH --- ETH Zürich (Switzerland)
-- Florian Schwennsen Deutsches Elektronen-Synchrotron DESY Building 01 Room O1.446 phone: +49-40-8998-6190
From: "Samuele Kaplun" notifications@github.com To: "inspirehep/inspire-next" inspire-next@noreply.github.com Cc: "Florian Schwennsen" florian.schwennsen@desy.de, "Mention" mention@noreply.github.com Sent: Friday, 20 January, 2017 09:52:05 Subject: Re: [inspirehep/inspire-next] Deduce institution from raw affiliation (#1875)
And we have the results: 🥁
* 83% recognized successfully by afftranslator in the curated list.
- 44% afftranslator recognized the same wrong ICN from the bad list (i.e. afftranslator + human non checking is responsible for 44% of the wrong ICNs).
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub , or mute the thread .
(Following conversation with @kaplun and @jacquerie to formalize the requirements.)
An important part of the work of cataloguers is adding affiliation identifiers (
ICN
s) for authors based on the affiliations written on the paper. It would be extremely worthwile to automatize this process.There are two required steps:
Extract the affiliation string from the PDF (or possibly LaTeX for an arXiv paper) to populate the
raw_affiliation
. One might use GROBID for this step.Normalize the
raw_affilation
into theICN
. The state of the art is @fschwenn'safftranslator2
(see #1873). @jacquerie suggests to use ES instead.Example:
raw_affiliation
is "Institute for Fundamental Theory, Department of Physics, University of Florida, Gainesville, Florida 32611, USA". TheICN
isFlorida U., Inst. Fund. Theor.
corresponding to record https://inspirehep.net/record/908013.I can give more complex examples (several authors, each having several affiliations) if needed.