some URLs are not extracted in DAS

kermitt2 / grobid

A machine learning software for extracting information from scholarly documents

Apache License 2.0

3.59k stars 459 forks source link

Found a case where the URL is not extracted from the PDF:

10.1371_journal.pone.0215651.pdf

Without sentence segmentation:

<p>The wind and water level data is open access provided by the Swedish Meteorological and Hydrological Institute, SMHI, through www.smhi.se. The grain size data is available through an FTP server with address www. tvrl.se/caf/ftp/Sieved_grain_size_samples.xlsx. Model code, input and results files will not be provided open access but can be made available upon request. Contact Caroline Hallin at Lund University, Sweden, at caroline.hallin@tvrl.lth.se. Topographic and bathymetric data cannot be sha

With sentence segmentation"

FTP server with address www.</s><s xml:id="_4pDBZYt" coords="1,36.00,626.65,138.17,6.80">tvrl.se/caf/ftp/Sieved_grain_size_samples.xlsx.</s>

kermitt2 / grobid

some URLs are not extracted in DAS #1184