kermitt2 / grobid

A machine learning software for extracting information from scholarly documents
https://grobid.readthedocs.io
Apache License 2.0
3.46k stars 446 forks source link

Character spacing issues with pdftoxml #48

Open kermitt2 opened 9 years ago

kermitt2 commented 9 years ago

In the line "Converting PDF to XML is a bit like converting hamburgers into cows" (Peter Murray-Rust), character spacing in PDF is not what we see ;)

A good PDF processing libraries like pdftoxml is trying to recreate valid spacing (with respect to the visual rendering), but of course it is difficult.

It appears that quite a lot of PDF result in problematic charcater spacing for some fonts, in particular in the header section. For instance this pdf, the author sequence is extracted by pdftoxml as:

screen shot 2015-03-13 at 4 26 51 pm

   M ihael ARCAN 1 Chr ist ian F E DERM AN N 2 Paul BU I T E LAAR 1

It is then very hard for the CRF to predict a good sequence labeling on this...

Note that Mac OS X Preview is recomposing it right, a direct cut and paste form the PDF gives:

 Mihael ARCAN1 Christian FEDERMANN2 Paul BUITELAAR1

If Apple can do it, we can certainly do it right too ;)

kermitt2 commented 9 years ago

Some possibilities: