kermitt2 / grobid

A machine learning software for extracting information from scholarly documents
https://grobid.readthedocs.io
Apache License 2.0
3.49k stars 449 forks source link

whitespace within a URL string in GROBID converted text #679

Open caifand opened 3 years ago

caifand commented 3 years ago

Hi :)

It's just that when I use the GROBID converted text from PDFs, our downstream collaborators found that there exists whitespace within a URL string. For example, in our use cases where GROBOD produced text from PMC articles, quotes from our downstream collaborator:

I noticed in the existing validation articles there are sometimes spurious spaces inside URLs. I wanted to draw your attention to this as it may be correctable upstream.

For example in PMC3529402-0 (I highlighted the spaces in red.):

These digital images were processed via Roman v1.7 software (Roman software version V1.70; Robert Jones and Agnes Hunt Orthopaedic Hospital, Oswestry, UK; http://www.cookedbits.co.uk/roma n/).

PMC3762194-7: predicted in the AUGUSTUS web server (http://bioinf.uni-greifswald.de/ webaugustus/prediction/create, last accessed

I suspect that the original PDF publication process inserts spaces or line breaks because URLs are so long and need line breaks. Maybe the line breaks get converted to spaces at some point.

I checked the corresponding PDF file; indeed in these two cases the whitespace arises where there is a line break in the original PDF file. So I am leaving a note here :)

kermitt2 commented 3 years ago

Thanks a lot for the issue Fan!

This is a case that should be doable in our "de-hyphenation" process.