Open tukusejssirs opened 5 years ago
Commit https://github.com/nigelhorne/gedcom/commit/0a8d02659ed11edf90ab391eded9ed0479316296 uses Unicode::Diacritic::Strip, though it doesn't yet work, at least not with a test case that I have.
I don’t know why the u:d:s doesn’t work with gedcom. All of my test code outside of it works fine. Still investigating.
It looks like u:d:s doesn't work with UTF-8 fields from gedcoms (perhaps those only from ACOM). I've recently put in some improvements in ged2site which should permeate here.
The code still doesn't handle all diacritics, but it should be better than it was, for both UTF-8 and Unicode.
As we talked in #95 (from which I simply copied portions to this issue), we should improve the parsing of names that include diacritics (like
ľščťžýáíéúäôňďěŕĺöüűő
).As we talked there,
Lingua::EN::NameParse
(which you use for parsing names) currently does not support parsing names with diacritics. However,Lingua::EN::NameParse
has the following notes in itsperlpod
docs:So, I think for now it would be good enough to use that workaround, but it would be nice (if it is possible) to re-replace the names with their original spelling after parsing, that is:
Mária
→Maria
),Maria
→Mária
).However, it would be much better to implement
Lingua::SK::NameParse
as it is written in the _Future directions. I’d like to contact Kim Ryan (the dev of Lingua::En::NameParse
) if he is interested. Although I can code in Perl a bit, I am not a pro programmer. I could mainly assist in the liguist/algorithm part. Are you willing to help with the coding of this parser? Or you are busy enough with other stuff? :)