kermitt2 / grobid

A machine learning software for extracting information from scholarly documents
https://grobid.readthedocs.io
Apache License 2.0
3.26k stars 439 forks source link

Author affiliation not associated correctly #309

Open de-code opened 6 years ago

de-code commented 6 years ago

From the same problem manuscript, the author affiliation isn't associated correctly (considateHeader=0): https://www.biorxiv.org/content/biorxiv/early/2018/03/26/287888.full.pdf

Authors from the PDF:

Battram T1,2*,   Richmond RC1,2*, Baglietto L3†, Haycock P1,2†, Perduca V4, Bojesen S5,6,7, Gaunt TR1,2,
Hemani G1,2, Guida F 8, Carreras-­‐Torres R8, Hung R9, Amos CI10 , Freeman JR11, Sandanger M12, Nøst 
TH13, Nordestgaard B5,6,7, Teschendorff AE14,15,16, Polidoro S17, Vineis P17,18, Severi G19,20,21,22, Hodge A22,
Giles G21,22, Grankvist K23, Johansson MB24, Johansson M8, Davey Smith G1,2$, Relton CL1,2$ 

Affiliations from the PDF:

1. MRC Integrative Epidemiology Unit, University of Bristol, Bristol, UK 
2. Population Health Sciences, Bristol Medical School, University of Bristol, Bristol, UK 
3. Department of Clinical and Experimental Medicine, University of Pisa, Pisa, Italy 
4. Laboratoire de Mathématiques Appliquées – MAP5 (UMR CNRS 8145), Université Paris Descartes 
5. Department of Clinical Biochemistry, Herlev and Gentofte Hospital, Copenhagen University 
Hospital, Herlev, Denmark 
6. Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, Denmark 
7. The Copenhagen City Heart Study, Frederiksberg Hospital, Copenhagen University 
Hospital, Copenhagen, Denmark 
8. Genetic Epidemiology Division, International Agency for Research on Cancer, Lyon, France 
9. Lunenfeld-­‐Tanenbaum Research Institute, Sinai Health System, Toronto, Canada 
10. Biomedical Data Science, Geisel School of Medicine, Dartmouth College, Hanover, New Hampshire, 
USA 
11. Department of Biostatistics and Epidemiology, University of Massachusetts, Massachusetts, USA 
12. Department of Community Medicine, UiT-­‐ The Arctic University of Norway, Tromso, Norway 
13. Department of Public Health and Nursing, Norwegian University of Science and Technology 
(NTNU), Trondheim, Norway 
14. Department of Women's Cancer, Institute for Women's Health, University College London, London, 
UK 
15. UCL Cancer Institute, University College London, London, UK 
16. Chinese Academy of Sciences (CAS) Key Laboratory of Computational Biology, CAS–Max Planck 
Gesellschaft (MPG) Partner Institute for Computational Biology, Shanghai, China 
17. Italian Institute for Genomic Medicine, Torino, Italy 
18. Department of Epidemiology and Biostatistics, the School of Public Health, Imperial College 
London, London, UK 
19. Centre de Recherche en Epidémiologie et Santé des Populations – CESP (UMR INSERM 1018), 
Université Paris-­‐Saclay, Université Paris-­‐Sud, Paris, France 
20. Gustave Roussy, Villejuif, France
21. Cancer Epidemiology Centre, Cancer Council Victoria, Melbourne, Australia 
22. Centre for Epidemiology and Biostatistics, Melbourne School of Population & Global Health, The 
University of Melbourne, Australia 
23. Department of Biobank Research, Umeå University, Sweden 
24. Department of Radiation Sciences, Umeå University, Sweden 

It get's it right for the first author (including affiliations 1 and 2).

For the second author it is including affiliation 1, 2 and 3 (despite it only mentioning 1, 2 in the PDF as for the first author).

Further to the end it exports affiliations without authors.

kermitt2 commented 5 years ago

See #451 for other cases