kermitt2 / grobid

A machine learning software for extracting information from scholarly documents
https://grobid.readthedocs.io
Apache License 2.0
3.5k stars 449 forks source link

Conflict of interests missing from xml output #1142

Open mariadelmarq opened 3 months ago

mariadelmarq commented 3 months ago

Hi,

We are looking into using Grobid for a project to look into conflict of interest, funding, and other transparency statements in published articles. These statements are put in different random locations depending on the publisher, sometimes in footnotes, sometimes after that abstract, sometime in the back matter, etc.

For the published pdf for this particular article (not the author manuscript, which is open access, but the actual published pdf by the APA): https://pubmed.ncbi.nlm.nih.gov/27819460/, Grobid does well to extract the funding information from paragraph 4 of the footnote on page 1, but the conflict of interest, contained in paragraph 5 of the same footnote, is missing from the xml output. I suspect perhaps Grobid does not know where to put it in the xml... Is there any chance this has an easy fix?

lfoppiano commented 3 months ago

Hi @mariadelmarq, thanks for reporting this problem.

Could you please send me the PDF of this issue and on #1143 at luca AT sciencialab.com?

I'm not able to access them via the pubmed / publisher portal 😅

mariadelmarq commented 3 months ago

Sent, thanks heaps for looking into it!

lfoppiano commented 3 months ago

Thanks for sending the files, I'm sorry, I did not have time to check them till now.

Untitled

For the file discussed in this issue, there are two issues:

  1. The header model truncated the funding information, and the part that is missing (near Lee M. Ritterband tagged as <other> is somehow lost). For this I'm not sure it's a bug, because the funding information is correctly covered. As far as I understood, the conflict of interests should not be part of the funding statement as in the grobid approach, or at least for this version of the funding-acknowledgment extraction. I leave this to @kermitt2, for confirmation.
  2. This issue point out an interesting aspect, that there is indeed a need to keep the text that is not classified in the header, which now is kind of lost, and we might want to collect it somewhere in the XML output
  3. There is another issue with the segmentation, as the first paragraph is also missing from the output XML. All the traning data of grobid is limited to CC-BY documents so it's possible that this kind of layout has not received particular attention and training data. Nevertheless, it is possible to create private training data to train grobid for supporting this kind of documents.
kermitt2 commented 3 months ago

Hello !

Indeed Conflict Of Interest section is not part of the funding section and is considered as a section on its own. However it's not identified explicitly as such by Grobid yet. This is something to do in the future, so extend the segmentation and header models to explicitly recognize COI sections, which is not something complicated I think. I already received this request, COI is more and more common.

About the text lost in the header, what is labeled with other is normally "noise" that we don't want to add to the output (even under a note element). In this example case, it is not working unfortunately, but if we extend the model(s) to cover COI, we can expect a good fix.

mariadelmarq commented 3 months ago

Thank you both so much for looking into this. For the other articles I'm looking at, Conflict of Interest statements tend to end up in the back matter tag, either one or two divs down, or sometimes within a note tag. Sometimes they do end up in the body, though, which is ok for me, as long as they're somewhere.