hym1224 / note

1 stars 0 forks source link

Character set & character encoding. #10

Open hym1224 opened 6 years ago

hym1224 commented 6 years ago

from http://blog.jobbole.com/84903/

hym1224 commented 6 years ago

1:字符集只是一个规则集合的名字,对应到真实生活中,字符集就是对某种语言的称呼。例如:英语,汉语,日语。 2:对于一个字符集来说要正确编码转码一个字符需要三个关键元素:字库表(character repertoire)、编码字符集(coded character set)、字符编码(character encoding form) 其中字库表是一个相当于所有可读或者可显示字符的数据库,字库表决定了整个字符集能够展现表示的所有字符的范围。编码字符集,即用一个编码值code point来表示一个字符在字库中的位置。字符编码,将编码字符集和实际存储数值之间的转换关系。 3:Unicode就是上文中提到的编码字符集,而UTF-8就是字符编码, 4:resolve


select hex(convert('寰堝睂' using gbk))
union
select convert(0xE5BE88E5B18C using utf8);
hym1224 commented 6 years ago

How to filter (or replace) unicode characters that would take more than 3 bytes? from https://stackoverflow.com/questions/3220031/how-to-filter-or-replace-unicode-characters-that-would-take-more-than-3-bytes resolved python