Open suparek opened 7 years ago
Wish you got some progress on this topic, I am also interested.
As far as I know, the core part for mLSTM is to train the model with utf-8 encoded sequence.
If you look into the code in utils.py, in the preprocess() function,
def preprocess(text, front_pad='\n ', end_pad=' '):
text = html.unescape(text)
text = text.replace('\n', ' ').strip()
text = front_pad+text+end_pad
text = text.encode()
return text
So, if you figure out how to convert the Chinese charactor into utf-8 encode, you shall be able to feed the sequence into the model for training.
ttt = u'年集中发力的领域'
ttt
Out[50]: '年集中发力的领域'
type(ttt)
Out[51]: str
encoded_ttt = ttt.encode("utf-8")
encoded_ttt
Out[53]: b'\xe5\xb9\xb4\xe9\x9b\x86\xe4\xb8\xad\xe5\x8f\x91\xe5\x8a\x9b\xe7\x9a\x84\xe9\xa2\x86\xe5\x9f\x9f'
for word in encoded_ttt.decode("utf-8"):
print(word)
年
集
中
发
力
的
领
域
hi @gitathrun. Are you using python 2 or 3?
@jonnykira python 3.5
Cool, that should work then. For python 2 you would also have to convert the UTF-8 string to a bytearray object within preprocess().
Out of curiosity have you successfully trained a model on Chinese data?
@jonnykira Firstly, thanks for your code on mlstm, very sleek and well formed tensorflow code. No, I have not done any yet, but I will train a Chinese based model sometimes in the future.
wish reply