kermitt2 / grobid

A machine learning software for extracting information from scholarly documents
https://grobid.readthedocs.io
Apache License 2.0
3.55k stars 454 forks source link

Author affiliations not extracted correctly #451

Open danielrbrowne opened 5 years ago

danielrbrowne commented 5 years ago

Attached is an example of a paper (when converted to PDF in LibreOffice) where author affiliations are orphaned from their associated authors (i.e. a separate <author> is present with a nested <affiliation>) as well some of the affiliations being missing altogether. I also noted the last author has not been extracted at all. This was using the /processHeaderDocument endpoint.

Manuscript (1).docx Manuscript (1).pdf

danielrbrowne commented 5 years ago

Here is another example where there are orphaned affiliations are extracted from a manuscript. I'm guessing in this specific case the line numbers could be throwing the extraction heuristics off, but I'm not sure?: 671727.full.pdf

danielrbrowne commented 5 years ago

Another example. I am attaching this one as an extreme example, and I acknowledge there may not be anything sensible that Grobid can do considering there's no actual cross-referencing in the manuscript between the authors and their affiliations. I.e. it's a layout a human can understand easily, but ML maybe less-so:

(asce)1532-3641(2001)1&3c1(21).pdf

danielrbrowne commented 5 years ago

Another example of poorly extracted title, authors and affiliations. Also the abstract has failed to be extracted at all. This was using the /processHeaderDocument endpoint. I think the running theme with at least some of these documents seems to be the inclusion of line numbers throwing off Grobid? latex 1.pdf

kermitt2 commented 5 years ago

Thanks a lot @danielrbrowne for the problematic use cases, it's useful to have them together with a description of the issues.

Indeed the review format with line number is not something supported well by GROBID for the moment and would require some more layout analysis/features. It's not a problem of heuristics, it's really breaking the machine learning which is trained in uninterrupted field sequences. These line numbers explain most of the errors I think (usually fields not interrupted by these numbers are pretty okay).

Regarding the first document, the layout of the header looks simple for us, but a bit unusual when compared to the existing training data (affiliation without address or country like the NIEHS for example). It's an interesting case which could be typically tackled I think by adding a couple examples like that in the training data.

About the third, I think GROBID is doing great given the "no worry" affiliation list without any cross-referencing - this is really a layout never seen in the training data. Covering that would be a more long term goal I think (affiliation attachment is heuristics-based).

Current header model need to be reworked entirely, it's the oldest model and there are quite a lot of new information and improvement that could be used now - in particular new reading order from PDF, spacing, etc. The open issue on this is from 2016 ... https://github.com/kermitt2/grobid/issues/136

It requires quite a lot of work, in particular updating all the existing training data, so it's hard to plan/execute given that this project remains a side work for the contributors. It's easier to realize small "low hanging fruit" tasks :)

Thanks again for all these test cases, they are always welcome.