为什么chinese字符也encode成255个

BtPig19841 commented 5 years ago

UnicodeCharsVocabulary类中_convert_word_to_char_ids函数您注释是chinese也可以成为255个ids，但汉字字符应该明显多于255个呀，所以用utf解码变成ids是不是不合适？

guotong1988 commented 5 years ago

汉字的unicode编码，不是汉字

BtPig19841 commented 5 years ago

a = '中华人民共和国' word_encoded = a.encode('utf-8', 'ignore')[:] print(word_encoded) b'\xe4\xb8\xad\xe5\x8d\x8e\xe4\xba\xba\xe6\xb0\x91\xe5\x85\xb1\xe5\x92\x8c\xe5\x9b\xbd' 源码是针对英文的，是将word中每一个字符转化为utf-8编码，然后用数组存起来，例如:english中，e:101, n:110, g:103, l:108, h:105, s:115, h:104，把这个当作查找表的ids。但如上，中文输出的二进制的解码三位是一个汉字，如何当作ids？

BtPig19841 commented 5 years ago

还有几个问题： 1.你代码中max_word_length是默认的20，但中文解码后是1:3，比如‘中华人民共和国’七个汉字，解码成‘utf8’，本应是3*7=21个二进制的码，但是如果max_word_length为20且余出首尾标识符的话，也就是源码中self.max_word_length-2，那么只剩18个二进制码，具体如下：

import numpy as np a = '中华人民共和国' max_word_length = 20 code = np.zeros([max_word_length], dtype=np.int32) code[:] = 0 word_encoded = a.encode('utf-8', 'ignore')[:max_word_length-2] print('编码后：',word_encoded,'\n个数为',len(word_encoded)) print('再次解码后:',word_encoded.decode())

编码后： b'\xe4\xb8\xad\xe5\x8d\x8e\xe4\xba\xba\xe6\xb0\x91\xe5\x85\xb1\xe5\x92\x8c' 个数为 18 再次解码后: 中华人民共和

非必要的把原信息截断了

BtPig19841 commented 5 years ago

bow_char = 258 # eow_char = 259 #

code[0] = bow_char #加上词开始和结尾的编码 for k, chr_id in enumerate(word_encoded, start=1): print(k, chr_id) code[k] = chr_id code[len(word_encoded) + 1] = eow_char #加上词开始和结尾的编码 print(code)

1 228 2 184 3 173 4 229 5 141 6 142 7 228 8 186 9 186 10 230 11 176 12 145 13 229 14 133 15 177 16 229 17 146 18 140 code: [258 228 184 173 229 141 142 228 186 186 230 176 145 229 133 177 229 146 140 259]

如上，最后按代码输出的话是这样的，对于中文处理好像有很大问题

guotong1988 commented 5 years ago

我保持词在4字及以下

BtPig19841 commented 5 years ago

那查找时候是不是得把三个码合到一起才能组合成一个汉字？问个问题，是不是重新建个汉字字的查找表好点？

BtPig19841 commented 5 years ago

原先的bilm是用cnn来提取英文字符的特征，中文字符还需要用cnn来提特征么？很少见中文字符用cnn提特征的！我看网上有的博客是把字符卷积这块略过了，你这代码用到cnn来处理汉字字符么？

guotong1988 commented 5 years ago

不清楚

edfall commented 5 years ago

作者压根就没有想过编码的事情。。。。基本上就是把原始repo复制了一次。。。

汉字明显要自己加一个字符字典，然后处理；否则像现在这样，还不如直接使用拼音。

guotong1988 commented 5 years ago

这样也有效果

RandomTuringDuck commented 5 years ago

之前有做过一个比赛，加了elmo效果提升确实是有的。但是肯定要编码大小等于你这个词表的汉字个数的，不然可能编码是会有问题的，至于cnn来提取特征这一块，可以加也可以不加，因为本来针对词汇就有一个embedding了，如果要加的话可能filter_size会比较小，可能大部分是2,3,4,5这样的，基本上不会再大了，加一层cnn提特征总比不加要好一些，毕竟信息多了一些。

rokid / ELMo-chinese

为什么chinese字符也encode成255个 #3