taishi-i / nagisa

A Japanese tokenizer based on recurrent neural networks
https://huggingface.co/spaces/taishi-i/nagisa-demo
MIT License
379 stars 22 forks source link

Why do you have 6 dim outputs for word segmentation? #21

Closed wannaphong closed 4 years ago

wannaphong commented 4 years ago

from https://github.com/taishi-i/nagisa/blob/master/nagisa/model.py#L59, Why do you have 6 DIM outputs for word segmentation? encode_ws has 6 DIM outputs. I understand you using BMES (4 dim first). What are the last two boxes used for? Could you explain that, please?

Thank you.

taishi-i commented 4 years ago

Hi @wannaphong, thank you for your question.

I use the BMES tags (4 dims) and the special tokens (2 dims) the for word segmentation. The special tokens are the Start of Sentence (SOS) ans End of Sentence (EOS).

The CRF layer of the BiLSTM-CRF uses the transitions matrix to calculate tag dependencies. It means that the M tag frequently occurs after the B tag and the E tag frequently occurs after the M tag. To implement this function, it needs the special tokens (SOS and EOS). The special tokens represent that the B tag frequently occurs after the SOS token and the E tag frequently occurs next to EOS tag.

I refer to the BiLSTM-CRF model architecture https://www.aclweb.org/anthology/N16-1030/ and its original implementation https://github.com/glample/tagger/blob/1c9618889fb89500cc5e70c45c27859b89d44449/model.py#L285.

Thank you.

wannaphong commented 4 years ago

Thank you. 👍