kermitt2 / grobid

A machine learning software for extracting information from scholarly documents
https://grobid.readthedocs.io
Apache License 2.0
3.59k stars 459 forks source link

Fix URL extraction when the regex falls short #1190

Open lfoppiano opened 4 weeks ago

lfoppiano commented 4 weeks ago

This PR fixes the URL extraction when the regular expression is shorter than the actual target (the annotated URL).

coveralls commented 4 weeks ago

Coverage Status

coverage: 40.768% (+0.01%) from 40.755% when pulling 35ec905702a2e8e5557f460cb6f71cd5ec06a689 on fix-url-extraction-regex-shorter into be44579606f3953473119edf5e34701aad9f1a55 on master.

lfoppiano commented 1 week ago

Added a fix for the edge case:

image

Where genius editors are adding the - for breaking up an URL over two lines.

Here the document: https://doi.org/10.1038/s41588-024-01785-9