kermitt2 / pdfalto

PDF to XML ALTO file converter
GNU General Public License v2.0
213 stars 68 forks source link

combining chars eat spaces #120

Closed tsgit closed 3 years ago

tsgit commented 3 years ago

This applies to current git HEAD of pdfalto, and also Grobid release.

A combining character leads to removal of a preceding space.

In Grobid this combines surnames starting with such a character with the given name.

A surname starting with Z with a Combining Dot Above (NFD) Z (U+005A) - ◌̇ (U+0307) and a leading space Filip Żarnecki in the PDF is turned into FilipŻarnecki in the pdfalto generated XML with Ż == U+017B (NFC)

Grobid turns this into

<persName xmlns="http://www.tei-c.org/ns/1.0">
      <forename type="first">Aleksander</forename>
      <surname>Filipżarnecki</surname>
</persName>

See PDF file at https://arxiv.org/pdf/2104.00046

Similarly Haris Čolić is turned into HarisČolić. The C with combining caron U+030C (NFD) is turned into U+010C (NFC) and the leading space disappears.

See PDF file at https://arxiv.org/pdf/2104.00329

and Daniel Ávila the A with combining acute accent U+0301 is turned into DanielÁvila U+00C1 and no leading space

See PDF file at https://arxiv.org/pdf/2101.08802.

I run pdfalto simply as ./pdfalto -l 1 2101.08802.pdf

The extended xpdfrc file with additional language support is in the same directory, and the languages subdirectory is present, too.

kermitt2 commented 3 years ago

Thanks a lot @tsgit for the issue and the very clear error cases !

This should be fixed with PR #121

What makes the case complicated is that there is no space character, they have to be introduced based on the position of characters, same for word breaks. In the case of character composition, as the modifier character can be composed with a character on the left or right, it's a bit tricky to see where exactly to look for the word break and space, and to keep a correct baseline for the next word (baseline is not the one of the modifier character!).

Anyway, bug should be fixed, it looks good for the 3 PDF and I also ran some more tests where the character composition takes place within a word, just to be sure I don't introduce spurious word breaks now.

tsgit commented 3 years ago

Thanks a lot for the swift response, much appreciated. I confirm that the PR fixes the issue. Nice work!