tchwork / utf8

Portable and performant UTF-8, Unicode and Grapheme Clusters for PHP
Apache License 2.0
627 stars 50 forks source link

Fix for Turkish "small dotless i" problem in Utf8::toAscii #2

Closed navruzm closed 11 years ago

navruzm commented 11 years ago

Currently when you have an "ı" (small dotless i) in your string, Utf8::toAscii doesn't convert properly this character. "ı" converted to "?" instead of "i".

This is a known Unicode CLDR bug opened 3 years ago and does not seem to be fixed. They said "it should be fixed next release, in bug #3335" which opened 2 years ago. I don't know when they fix that, but that bug causes problem like this one laravel/framework#552

nicolas-grekas commented 11 years ago

Hi, thanks for this pull request. I would replace your line of code with this one:

if (false !== strpos($s, 'ı')) $s = str_replace('ı', 'i', $s);

the small dotless i is listed in: http://unicode.org/repos/cldr/trunk/common/transforms/Latin-ASCII.xml but this is not the data source that is used by iconv.

We should compare the mapping in this XML files with the one done by iconv and see if other character differs...

nicolas-grekas commented 11 years ago

Here are all the characters that are mapped to something based on Latin-ASCII.xml, but are not mapped by iconv on ubuntu (the dotless small i is in the list).

Not sure about what do to with this list now...

Ð ? D
Ø ? O
Þ ? TH
ð ? d
ø ? o
þ ? th
Đ ? D
đ ? d
Ħ ? H
ħ ? h
ı ? i
ĸ ? q
Ŋ ? N
ŋ ? n
Ŧ ? T
ŧ ? t
ƀ ? b
Ɓ ? B
Ƃ ? B
ƃ ? b
Ƈ ? C
ƈ ? c
Ɖ ? D
Ɗ ? D
Ƌ ? D
ƌ ? d
Ɛ ? E
Ƒ ? F
ƒ ? f
Ɠ ? G
ƕ ? hv
Ɩ ? I
Ɨ ? I
Ƙ ? K
ƙ ? k
ƚ ? l
Ɲ ? N
ƞ ? n
Ƣ ? OI
ƣ ? oi
Ƥ ? P
ƥ ? p
ƫ ? t
Ƭ ? T
ƭ ? t
Ʈ ? T
Ʋ ? V
Ƴ ? Y
ƴ ? y
Ƶ ? Z
ƶ ? z
Ǥ ? G
ǥ ? g
ȡ ? d
Ȥ ? Z
ȥ ? z
ȴ ? l
ȵ ? n
ȶ ? t
ȷ ? j
ȸ ? db
ȹ ? qp
Ⱥ ? A
Ȼ ? C
ȼ ? c
Ƚ ? L
Ⱦ ? T
ȿ ? s
ɀ ? z
Ƀ ? B
Ʉ ? U
Ɇ ? E
ɇ ? e
Ɉ ? J
ɉ ? j
Ɍ ? R
ɍ ? r
Ɏ ? Y
ɏ ? y
ɓ ? b
ɕ ? c
ɖ ? d
ɗ ? d
ɛ ? e
ɟ ? j
ɠ ? g
ɡ ? g
ɢ ? G
ɦ ? h
ɧ ? h
ɨ ? i
ɪ ? I
ɫ ? l
ɬ ? l
ɭ ? l
ɱ ? m
ɲ ? n
ɳ ? n
ɴ ? N
ɶ ? OE
ɼ ? r
ɽ ? r
ɾ ? r
ʀ ? R
ʂ ? s
ʈ ? t
ʉ ? u
ʋ ? v
ʏ ? Y
ʐ ? z
ʑ ? z
ʙ ? B
ʛ ? G
ʜ ? H
ʝ ? j
ʟ ? L
ʠ ? q
ʣ ? dz
ʥ ? dz
ʦ ? ts
ʪ ? ls
ʫ ? lz
ᴀ ? A
ᴁ ? AE
ᴃ ? B
ᴄ ? C
ᴅ ? D
ᴆ ? D
ᴇ ? E
ᴊ ? J
ᴋ ? K
ᴌ ? L
ᴍ ? M
ᴏ ? O
ᴘ ? P
ᴛ ? T
ᴜ ? U
ᴠ ? V
ᴡ ? W
ᴢ ? Z
ᵫ ? ue
ᵬ ? b
ᵭ ? d
ᵮ ? f
ᵯ ? m
ᵰ ? n
ᵱ ? p
ᵲ ? r
ᵳ ? r
ᵴ ? s
ᵵ ? t
ᵶ ? z
ᵺ ? th
ᵻ ? I
ᵽ ? p
ᵾ ? U
ᶀ ? b
ᶁ ? d
ᶂ ? f
ᶃ ? g
ᶄ ? k
ᶅ ? l
ᶆ ? m
ᶇ ? n
ᶈ ? p
ᶉ ? r
ᶊ ? s
ᶌ ? v
ᶍ ? x
ᶎ ? z
ᶏ ? a
ᶑ ? d
ᶒ ? e
ᶓ ? e
ᶖ ? i
ᶙ ? u
ẜ ? s
ẝ ? s
ẞ ? SS
Ỻ ? LL
ỻ ? ll
Ỽ ? V
ỽ ? v
Ỿ ? Y
ỿ ? y
₠ ? CE
₢ ? Cr
₣ ? Fr.
₤ ? L.
₧ ? Pts
₹ ? Rs
℞ ? Rx
〇 ? 0
′ ? '
〝 ? "
〞 ? "
‖ ? ||
⁅ ? [
⁆ ? ]
⁎ ? *
、 ? ,
。 ? .
〈 ? <
〉 ? >
《 ? <<
》 ? >>
〔 ? [
〕 ? ]
〘 ? [
〙 ? ]
〚 ? [
〛 ? ]
︑ ? ,
︒ ? .
︹ ? [
︺ ? ]
︽ ? <<
︾ ? >>
︿ ? <
﹀ ? >
÷ ? /
∥ ? ||
⦅ ? ((
⦆ ? ))
nicolas-grekas commented 11 years ago

Fixed in https://github.com/nicolas-grekas/Patchwork-UTF8/commit/cc3f2eadf0c0dc6450f978c606f503575907deb4

navruzm commented 11 years ago

Thanks