whitespace within a URL string in GROBID converted text

Hi :)

It's just that when I use the GROBID converted text from PDFs, our downstream collaborators found that there exists whitespace within a URL string. For example, in our use cases where GROBOD produced text from PMC articles, quotes from our downstream collaborator:

I noticed in the existing validation articles there are sometimes spurious spaces inside URLs. I wanted to draw your attention to this as it may be correctable upstream.

For example in PMC3529402-0 (I highlighted the spaces in red.):

These digital images were processed via Roman v1.7 software (Roman software version V1.70; Robert Jones and Agnes Hunt Orthopaedic Hospital, Oswestry, UK; http://www.cookedbits.co.uk/roma n/).

PMC3762194-7: predicted in the AUGUSTUS web server (http://bioinf.uni-greifswald.de/ webaugustus/prediction/create, last accessed

I suspect that the original PDF publication process inserts spaces or line breaks because URLs are so long and need line breaks. Maybe the line breaks get converted to spaces at some point.

I checked the corresponding PDF file; indeed in these two cases the whitespace arises where there is a line break in the original PDF file. So I am leaving a note here :)

kermitt2 / grobid

whitespace within a URL string in GROBID converted text #679