Closed tsgit closed 3 years ago
Thanks a lot @tsgit for the issue and the very clear error cases !
This should be fixed with PR #121
What makes the case complicated is that there is no space character, they have to be introduced based on the position of characters, same for word breaks. In the case of character composition, as the modifier character can be composed with a character on the left or right, it's a bit tricky to see where exactly to look for the word break and space, and to keep a correct baseline for the next word (baseline is not the one of the modifier character!).
Anyway, bug should be fixed, it looks good for the 3 PDF and I also ran some more tests where the character composition takes place within a word, just to be sure I don't introduce spurious word breaks now.
Thanks a lot for the swift response, much appreciated. I confirm that the PR fixes the issue. Nice work!
This applies to current git HEAD of pdfalto, and also Grobid release.
A combining character leads to removal of a preceding space.
In Grobid this combines surnames starting with such a character with the given name.
A surname starting with Z with a Combining Dot Above (NFD) Z (U+005A) - ◌̇ (U+0307) and a leading space
Filip Żarnecki
in the PDF is turned intoFilipŻarnecki
in the pdfalto generated XML with Ż == U+017B (NFC)Grobid turns this into
See PDF file at https://arxiv.org/pdf/2104.00046
Similarly
Haris Čolić
is turned intoHarisČolić
. The C with combining caron U+030C (NFD) is turned into U+010C (NFC) and the leading space disappears.See PDF file at https://arxiv.org/pdf/2104.00329
and
Daniel Ávila
the A with combining acute accent U+0301 is turned intoDanielÁvila
U+00C1 and no leading spaceSee PDF file at https://arxiv.org/pdf/2101.08802.
I run pdfalto simply as
./pdfalto -l 1 2101.08802.pdf
The extended xpdfrc file with additional language support is in the same directory, and the languages subdirectory is present, too.