Improve parsing of names that include diacritics

tukusejssirs commented 5 years ago

As we talked in #95 (from which I simply copied portions to this issue), we should improve the parsing of names that include diacritics (like ľščťžýáíéúäôňďěŕĺöüűő).

As we talked there, Lingua::EN::NameParse (which you use for parsing names) currently does not support parsing names with diacritics. However, Lingua::EN::NameParse has the following notes in its perlpod docs:

FUTURE DIRECTIONS

Define grammar for other languages. Hopefully, all that would be needed is to specify a new module with its own grammar, and inherit all the existing methods. I don't have the knowledge of the naming conventions for non-english languages.

BUGS

Names with accented characters (acute, circumfelx etc) will not be parsed correctly. A work around is to replace the character class [a-z] with \w in the appropriate rules in the grammar tree, but this could lower the accuracy of names based purely on ASCII text.

So, I think for now it would be good enough to use that workaround, but it would be nice (if it is possible) to re-replace the names with their original spelling after parsing, that is:

remove the diacritics (Mária → Maria),
parse the names as usual,
replace the parsed names with their original form (Maria → Mária).

However, it would be much better to implement Lingua::SK::NameParse as it is written in the _Future directions. I’d like to contact Kim Ryan (the dev of Lingua::En::NameParse) if he is interested. Although I can code in Perl a bit, I am not a pro programmer. I could mainly assist in the liguist/algorithm part. Are you willing to help with the coding of this parser? Or you are busy enough with other stuff? :)

nigelhorne commented 5 years ago

Commit https://github.com/nigelhorne/gedcom/commit/0a8d02659ed11edf90ab391eded9ed0479316296 uses Unicode::Diacritic::Strip, though it doesn't yet work, at least not with a test case that I have.

nigelhorne commented 5 years ago

I don’t know why the u:d:s doesn’t work with gedcom. All of my test code outside of it works fine. Still investigating.

nigelhorne commented 1 year ago

It looks like u:d:s doesn't work with UTF-8 fields from gedcoms (perhaps those only from ACOM). I've recently put in some improvements in ged2site which should permeate here.

nigelhorne commented 5 months ago

The code still doesn't handle all diacritics, but it should be better than it was, for both UTF-8 and Unicode.

nigelhorne / gedcom

Improve parsing of names that include diacritics #100