openai / generating-reviews-discovering-sentiment

Code for "Learning to Generate Reviews and Discovering Sentiment"
https://arxiv.org/abs/1704.01444
MIT License
1.51k stars 379 forks source link

Can it work on chinese ? how can I train my chinese text dataset to use this? #40

Open suparek opened 6 years ago

suparek commented 6 years ago

wish reply

alaakh42 commented 6 years ago

check issue #30

gitathrun commented 6 years ago

Wish you got some progress on this topic, I am also interested.

As far as I know, the core part for mLSTM is to train the model with utf-8 encoded sequence.

If you look into the code in utils.py, in the preprocess() function,

def preprocess(text, front_pad='\n ', end_pad=' '):
    text = html.unescape(text)
    text = text.replace('\n', ' ').strip()
    text = front_pad+text+end_pad
    text = text.encode()
    return text

So, if you figure out how to convert the Chinese charactor into utf-8 encode, you shall be able to feed the sequence into the model for training.

ttt = u'年集中发力的领域'

ttt
Out[50]: '年集中发力的领域'

type(ttt)
Out[51]: str

encoded_ttt = ttt.encode("utf-8")
encoded_ttt
Out[53]: b'\xe5\xb9\xb4\xe9\x9b\x86\xe4\xb8\xad\xe5\x8f\x91\xe5\x8a\x9b\xe7\x9a\x84\xe9\xa2\x86\xe5\x9f\x9f'
for word in encoded_ttt.decode("utf-8"):
    print(word)
年
集
中
发
力
的
领
域
jonny-d commented 6 years ago

hi @gitathrun. Are you using python 2 or 3?

gitathrun commented 6 years ago

@jonnykira python 3.5

jonny-d commented 6 years ago

Cool, that should work then. For python 2 you would also have to convert the UTF-8 string to a bytearray object within preprocess().

Out of curiosity have you successfully trained a model on Chinese data?

gitathrun commented 6 years ago

@jonnykira Firstly, thanks for your code on mlstm, very sleek and well formed tensorflow code. No, I have not done any yet, but I will train a Chinese based model sometimes in the future.