Closed xmxoxo closed 3 years ago
词表是sentencepiece训练的词表,训练方法正确的话每次是一样的,我重复实验过,你在使用时有遇到问题吗?
感觉是会有问题的,我重新训练出分词模型,然后用作者提供的模型进行预测,结果完全对不上,以下是一个例子:
请输入个句子(Q退出):The near-term policy remedies are clear: raise the minimum wage to a level that will keep a fully employed worker and his or her family out of poverty, and extend the earned-income tax credit to childless workers.
。容易政策方案有的反应常注利率的以色列人疫苗将在尼迪治理错误的话家和极其国长免受题的环境治理获得良好的美国对欢迎在
训练分词模型的命令用的是:
python tokenize.py
以下是训练后的词表文件前20行,以便核对: eng.vocab
<pad> 0
<unk> 0
<s> 0
</s> 0
▁t -0
in -1
▁a -2
he -3
re -4
on -5
▁the -6
er -7
at -8
en -9
▁s -10
▁c -11
▁o -12
it -13
an -14
es -15
chn.vocab:
<pad> 0
<unk> 0
<s> 0
</s> 0
—— -0
经济 -1
国家 -2
美国 -3
▁但 -4
一个 -5
20 -6
我们 -7
政府 -8
中国 -9
可能 -10
他们 -11
欧洲 -12
问题 -13
这一 -14
世界 -15
sentencepiece==0.1.85
调用tokenize.py中的test()方法得到的结果如下:
---------- python36 ----------
['▁美国总统', '特朗普', '今日', '抵达', '夏威夷', '。']
[13663, 277, 7391, 7284, 18335, 28722]
惨败特朗普利用其纽黑文班加西在
输出完成 (耗时: 2 秒)
对了,在语料处理过程中遇到过这样的错误:
11:19:20.37|F:>python get_corpus.py
Traceback (most recent call last):
File "get_corpus.py", line 17, in <module>
fch.writelines(ch_lines)
UnicodeEncodeError: 'gbk' codec can't encode character '\u2022' in position 3: illegal multibyte sequence
于是把get_corpus.py
代码改为了:
with open(ch_path, "w", encoding='utf8') as fch:
fch.writelines(ch_lines)
with open(en_path, "w", encoding='utf8') as fen:
fen.writelines(en_lines)
我自己又试了一次,是没问题的,已经在下载链接里更新词表,可以下载查看~
我注意到你的sentencepie和我使用的0.1.94版本不一致,可以换成0.1.94版本试试。
训练好的英文词表前20位:
<pad> 0
<unk> 0
<s> 0
</s> 0
▁t -0
in -1
▁a -2
he -3
re -4
on -5
▁the -6
er -7
at -8
en -9
▁s -10
▁c -11
▁o -12
it -13
an -14
es -15
中文词表前20位:
<pad> 0
<unk> 0
<s> 0
</s> 0
—— -0
经济 -1
国家 -2
美国 -3
▁但 -4
一个 -5
20 -6
我们 -7
政府 -8
中国 -9
可能 -10
他们 -11
欧洲 -12
问题 -13
▁这 -14
世界 -15
中文词表序号为-14的位置和你的结果确实是不一样的。
我关闭这个issue了,还有问题的话欢迎讨论哦 :) ~
只提供了模型下载,却没有提供词表下载; 这样的话预测结果还是错的,因为自己训练的词表模型是不一样的。