kermitt2 / grobid

A machine learning software for extracting information from scholarly documents
https://grobid.readthedocs.io
Apache License 2.0
3.6k stars 460 forks source link

GROBID does not parse author affiliations anymore. #1164

Closed mbosten closed 3 weeks ago

mbosten commented 2 months ago

I tried parsing PDFs today but GROBID seems to leave the author affiliation out for every document.

I used Docker with the GROBID DL model (0.8.1-name-address) and did not specify a consolidation service. Later, I tried the web service demo and include the consolidation service, but to no avail. I believe it is not due to specific PDF formatting, since previously correctly converted PDF's currently did not include affiliation information using the webservice demo (see example below).

Using the webservice with header and funder consolidation: `

Peiyao Li **(removed for privacy reasons)**

`

Running the same script in command line a few months earlier: `

PeiyaoLi **(removed for privacy reasons)** TKLNDST Nankai University
China

`

Any clue where the issue might lie? The affiliation parsing for me is the most important aspect of the GROBID output, so any help would be much appreciated!

lfoppiano commented 2 months ago

@mbosten the docker image 0.8.1-name-address is an experimental work in progress image.

I would recommend you to use the stable version 0.8.0.

lfoppiano commented 2 months ago

@mbosten I confirm there is a small issue with affiliation running grobid 0.8.1 full. The problem does not happens on the CRF-only image (lfoppiano/grobid:0.8.1), nor on the master version, using only the CRF models.

lfoppiano commented 2 months ago

@mbosten problem should be solved now for every docker image of 0.8.1