kermitt2 / grobid

A machine learning software for extracting information from scholarly documents
https://grobid.readthedocs.io
Apache License 2.0
3.59k stars 459 forks source link

some URLs are not extracted in DAS #1184

Closed lfoppiano closed 3 weeks ago

lfoppiano commented 1 month ago

Found a case where the URL is not extracted from the PDF:

10.1371_journal.pone.0215651.pdf

Without sentence segmentation:

<p>The wind and water level data is open access provided by the Swedish Meteorological and Hydrological Institute, SMHI, through www.smhi.se. The grain size data is available through an FTP server with address www. tvrl.se/caf/ftp/Sieved_grain_size_samples.xlsx. Model code, input and results files will not be provided open access but can be made available upon request. Contact Caroline Hallin at Lund University, Sweden, at caroline.hallin@tvrl.lth.se. Topographic and bathymetric data cannot be sha

With sentence segmentation"

FTP server with address www.</s><s xml:id="_4pDBZYt" coords="1,36.00,626.65,138.17,6.80">tvrl.se/caf/ftp/Sieved_grain_size_samples.xlsx.</s>
lfoppiano commented 1 month ago

I flagged as a bug, but it's more the fact that the URL are not matched in the first place. They start with www. instead of http|ftp.

Here there is a bit of a conundrum, should we try to extend these regexes, with the risk of messing them up? The regexes should be validated with the URL's PDF annotations when they are available.

Proposal for fix in https://github.com/kermitt2/grobid/pull/1185