kermitt2 / grobid

A machine learning software for extracting information from scholarly documents
https://grobid.readthedocs.io
Apache License 2.0
3.4k stars 444 forks source link

Trial registration statement missing from xml output #1143

Open mariadelmarq opened 1 month ago

mariadelmarq commented 1 month ago

Hi again,

Another "error case". For the published pdf for https://pubmed.ncbi.nlm.nih.gov/27917460/ (not the freely available author manuscript), the "Trial Registration" located just under the abstract on page 1 is missing from Grobid's xml output. Just checking if there is an easy fix, and happy to chat more or send more info through if it helps.

lfoppiano commented 1 month ago

This issue is partially mentioned in the #1142 (point 2), that this information that does not have a specific standardized place in the header is lost, and we should keep it or find a place for it.

kermitt2 commented 1 month ago

Hello! We could add the trial information in the data availability section. I think the issue is that we don't have text content about trial registration in the training corpus for data availability section currently, so it's ignored or it goes under the funding section. Having better recognition of the information about clinical trials is also an objective of the French Open Science Monitor.

mariadelmarq commented 1 month ago

That would be amazing! We're interested in general statements about preregistration, which includes clinical trial registration, so it would be great to capture the broader statements if at all possible!