A good PDF processing libraries like pdftoxml is trying to recreate valid spacing (with respect to the visual rendering), but of course it is difficult.
It appears that quite a lot of PDF result in problematic charcater spacing for some fonts, in particular in the header section. For instance this pdf, the author sequence is extracted by pdftoxml as:
M ihael ARCAN 1 Chr ist ian F E DERM AN N 2 Paul BU I T E LAAR 1
It is then very hard for the CRF to predict a good sequence labeling on this...
Note that Mac OS X Preview is recomposing it right, a direct cut and paste form the PDF gives:
Mihael ARCAN1 Christian FEDERMANN2 Paul BUITELAAR1
If Apple can do it, we can certainly do it right too ;)
improve pdftoxml to recover spacing from charcater positions. We can estimate the averation spacing ratio in a sequence of characters and use it to decide about the occurence of a space or not at a given place (so making minWordBreakSpace dynamic).
add a post processing in Grobid, for instance a specialised Shannon's noisy-channel model with a character language model for post-correction - like for OCR post-correction in the line of Kolak and Resnik (2002) and what I did in ection 3.9 of this paper -, however it might be computationally expensive, so only limited to particular sequences.
In the line "Converting PDF to XML is a bit like converting hamburgers into cows" (Peter Murray-Rust), character spacing in PDF is not what we see ;)
A good PDF processing libraries like pdftoxml is trying to recreate valid spacing (with respect to the visual rendering), but of course it is difficult.
It appears that quite a lot of PDF result in problematic charcater spacing for some fonts, in particular in the header section. For instance this pdf, the author sequence is extracted by pdftoxml as:
It is then very hard for the CRF to predict a good sequence labeling on this...
Note that Mac OS X Preview is recomposing it right, a direct cut and paste form the PDF gives:
If Apple can do it, we can certainly do it right too ;)