hankcs / multi-criteria-cws

Simple Solution for Multi-Criteria Chinese Word Segmentation
http://www.hankcs.com/nlp/segment/multi-criteria-cws.html
GNU General Public License v3.0
300 stars 84 forks source link

请问论文中的bigram的et具体是如何计算的 #4

Closed zxgineng closed 6 years ago

zxgineng commented 6 years ago

你好 感谢分享 请问论文中 ‘where ft = [ht; et] is the concatenation of BiLSTM hidden state and bigram feature embedding et’ 的et是如何计算的

另外 word embedding是指字的话 那character embedding又是对应什么呢

hankcs commented 6 years ago

请参考代码

        if options.bigram:
            for rep, word in zip(lstm_out, sentence):
                bi1 = dy.lookup(self.bigram_lookup, word[0], update=self.we_update)
                bi2 = dy.lookup(self.bigram_lookup, word[1], update=self.we_update)
                if self.dropout is not None:
                    bi1 = dy.dropout(bi1, self.dropout)
                    bi2 = dy.dropout(bi2, self.dropout)
                score_t = O * dy.tanh(H * dy.concatenate(
                    [bi1,
                     rep,
                     bi2]) + Hb) + Ob
                scores.append(score_t)

比字更小的是偏旁部首,但本文没有用到。请参考 https://github.com/hankcs/sub-character-cws 。它们用的代码基本上是同一套,所以代码里的word是字,char是偏旁。

zxgineng commented 6 years ago

@hankcs 非常感谢 dynet实在看不太懂 rep是lstm的输出 那word[0], word[1]是指什么 O dy.tanh(H dy.concatenate([bi1,rep,bi2]) + Hb) + Ob 中 H,HB,O,OB都是训练的参数吗

hankcs commented 6 years ago
  1. 对单词序列abc,当前词语为b,则word[0]为ab的bigram id,word[1]为bc的bigram id。
  2. 都是,Wx+b。