Closed lfoppiano closed 3 weeks ago
I flagged as a bug, but it's more the fact that the URL are not matched in the first place. They start with www.
instead of http|ftp
.
Here there is a bit of a conundrum, should we try to extend these regexes, with the risk of messing them up? The regexes should be validated with the URL's PDF annotations when they are available.
Proposal for fix in https://github.com/kermitt2/grobid/pull/1185
Found a case where the URL is not extracted from the PDF:
10.1371_journal.pone.0215651.pdf
Without sentence segmentation:
With sentence segmentation"