kermitt2 / grobid

A machine learning software for extracting information from scholarly documents
https://grobid.readthedocs.io
Apache License 2.0
3.59k stars 459 forks source link

Trial registration statement missing from xml output #1143

Open mariadelmarq opened 4 months ago

mariadelmarq commented 4 months ago

Hi again,

Another "error case". For the published pdf for https://pubmed.ncbi.nlm.nih.gov/27917460/ (not the freely available author manuscript), the "Trial Registration" located just under the abstract on page 1 is missing from Grobid's xml output. Just checking if there is an easy fix, and happy to chat more or send more info through if it helps.

lfoppiano commented 4 months ago

This issue is partially mentioned in the #1142 (point 2), that this information that does not have a specific standardized place in the header is lost, and we should keep it or find a place for it.

kermitt2 commented 4 months ago

Hello! We could add the trial information in the data availability section. I think the issue is that we don't have text content about trial registration in the training corpus for data availability section currently, so it's ignored or it goes under the funding section. Having better recognition of the information about clinical trials is also an objective of the French Open Science Monitor.

mariadelmarq commented 4 months ago

That would be amazing! We're interested in general statements about preregistration, which includes clinical trial registration, so it would be great to capture the broader statements if at all possible!