kennygrant / sanitize

Package sanitize provides functions for sanitizing text in golang strings.
BSD 3-Clause "New" or "Revised" License
334 stars 73 forks source link

corrected german transliterations #22

Closed retailify closed 5 years ago

kennygrant commented 6 years ago

Happy to merge this if you think it would be true for all (or even most) languages using these accents - would they all transform the accents in a similar way?

retailify commented 6 years ago

The transliterations in the pull request will conform to the german language. I've found this discussion and this duolingo discussion on the net. For my use case I'll need the umlauts transformations in this way.

We should have language dependent transliteration tables. So we could use your table as the standard table and language specific tables (eg. my table), that could be merged (detection of a locale).

What do you think?

kennygrant commented 6 years ago

Yes, probably a good idea to have another version of the function which takes a map[rune]string, something like:

func AccentsForRunes(s string, transliterations map[rune]string) string

I'd be happy to merge this if you do that, as it wouldn't change the original Accents function.

We could then consider if there are some characters where your transliteration makes sense for most languages and change some defaults, but I'd do that separately and do some research on it.

retailify commented 6 years ago

Ok, I'll implement the method.

streambinder commented 5 years ago

I'm sorry to re-pop this out, but I think @retailify's german transliterations corrected patch is not pertinent with Accents()'s function scope. If I'm not wrong, the use-case of that function is to find a standard way of representing the letters which are composing a word (or a generic string). The reason why I think that patch isn't relevant, is that it isn't just removing accents, it's kind of representing the way the word (or the string) should be written to represent its sound (is it?): removing accents mean removing the tick from the letter, i.e. à becoming a, and so on. As originally proposed by @kennygrant, a better way should be differentiate the two different scopes using two different functions.

kennygrant commented 5 years ago

I've already made a decision on this and merged it so don't want to reopen it unless it is causing significant problems for someone. The intent is to translate accented characters without losing too much meaning, I agree this could go both ways and is a judgement call.

We did have a translation of ø to oe already, and several others which are going from one character to two (e.g. þ to ph or æ to ae), so there is precedent there and it doesn't seem very harmful to sometimes use two characters when translating, where that makes sense.