Open danielrbrowne opened 5 years ago
Here is another example where there are orphaned affiliations are extracted from a manuscript. I'm guessing in this specific case the line numbers could be throwing the extraction heuristics off, but I'm not sure?: 671727.full.pdf
Another example. I am attaching this one as an extreme example, and I acknowledge there may not be anything sensible that Grobid can do considering there's no actual cross-referencing in the manuscript between the authors and their affiliations. I.e. it's a layout a human can understand easily, but ML maybe less-so:
Another example of poorly extracted title, authors and affiliations. Also the abstract has failed to be extracted at all. This was using the /processHeaderDocument
endpoint. I think the running theme with at least some of these documents seems to be the inclusion of line numbers throwing off Grobid?
latex 1.pdf
Thanks a lot @danielrbrowne for the problematic use cases, it's useful to have them together with a description of the issues.
Indeed the review format with line number is not something supported well by GROBID for the moment and would require some more layout analysis/features. It's not a problem of heuristics, it's really breaking the machine learning which is trained in uninterrupted field sequences. These line numbers explain most of the errors I think (usually fields not interrupted by these numbers are pretty okay).
Regarding the first document, the layout of the header looks simple for us, but a bit unusual when compared to the existing training data (affiliation without address or country like the NIEHS for example). It's an interesting case which could be typically tackled I think by adding a couple examples like that in the training data.
About the third, I think GROBID is doing great given the "no worry" affiliation list without any cross-referencing - this is really a layout never seen in the training data. Covering that would be a more long term goal I think (affiliation attachment is heuristics-based).
Current header model need to be reworked entirely, it's the oldest model and there are quite a lot of new information and improvement that could be used now - in particular new reading order from PDF, spacing, etc. The open issue on this is from 2016 ... https://github.com/kermitt2/grobid/issues/136
It requires quite a lot of work, in particular updating all the existing training data, so it's hard to plan/execute given that this project remains a side work for the contributors. It's easier to realize small "low hanging fruit" tasks :)
Thanks again for all these test cases, they are always welcome.
Attached is an example of a paper (when converted to PDF in LibreOffice) where author affiliations are orphaned from their associated authors (i.e. a separate
<author>
is present with a nested<affiliation>
) as well some of the affiliations being missing altogether. I also noted the last author has not been extracted at all. This was using the/processHeaderDocument
endpoint.Manuscript (1).docx Manuscript (1).pdf