taishi-i / nagisa

A Japanese tokenizer based on recurrent neural networks
https://huggingface.co/spaces/taishi-i/nagisa-demo
MIT License
379 stars 22 forks source link

Details about pre-trained nagisa model #19

Closed Isa-rentacs closed 4 years ago

Isa-rentacs commented 4 years ago

I have questions about the hyper parameters and corpus used to train the built-in model.

When I execute code below:

import nagisa
tagger = nagisa.Tagger()
print(tagger._hp)

I get

{
    'LAYERS': 1,
    'THRESHOLD': 2,
    'DECAY': 3,
    'EPOCH': 50,
    'WINDOW_SIZE': 3,
    'DIM_UNI': 32,
    'DIM_BI': 16,
    'DIM_WORD': 16,
    'DIM_CTYPE': 8,
    'DIM_TAGEMB': 16,
    'DIM_HIDDEN': 100,
    'LEARNING_RATE': 0.075,
    'DROPOUT_RATE': 0.2,
    'TRAINSET': '../../nlp2018/workshop/nagisa-train/data/bccwj.train',
    'TESTSET': '../../nlp2018/workshop/nagisa-train/data/bccwj.test',
    'DEVSET': '../../nlp2018/workshop/nagisa-train/data/bccwj.dev',
    'DICTIONARY': '../../nlp2018/workshop/nagisa-train/data/unidict.txt',
    'HYPERPARAMS': 'data/nagisa_v002.hp',
    'MODEL': 'data/nagisa_v002.model',
    'VOCAB': 'data/nagisa_v002.dict',
    'EPOCH_MODEL': 'data/epoch.model',
    'TMP_PRED': 'data/pred',
    'TMP_GOLD': 'data/gold',
    'VOCAB_SIZE_UNI': 3090,
    'VOCAB_SIZE_BI': 82114,
    'VOCAB_SIZE_WORD': 59260,
    'VOCAB_SIZE_POSTAG': 24
}

Here I have 3 questions:

  1. The prefix of the files (nagisa_v002) is different from the actual files (nagisa_v001). Is this just a matter of the filename?
  2. It says it used BCCWJ as the source data. I believe it's this BCCWJ (https://pj.ninjal.ac.jp/corpus_center/bccwj/), but would like to confirm this is the case.
  3. If the answer for the previous question is yes, could you share more about the training data such as the # of lines, word unit (short/long)?
taishi-i commented 4 years ago

Hi, @Isa-rentacs Thank you for using nagisa.

  1. The prefix of the files (nagisa_v002) is different from the actual files (nagisa_v001). Is this just a matter of the filename?

Yes! This is a matter of the filename. These files are same.

  1. It says it used BCCWJ as the source data. I believe it's this BCCWJ (https://pj.ninjal.ac.jp/corpus_center/bccwj/), but would like to confirm this is the case.
  2. If the answer for the previous question is yes, could you share more about the training data such as the # of lines, word unit (short/long)?

That's right. I used the core data of BCCWJ as the source data for training the nagisa's model. I used ClassA-1.list to extract the evaluation data. Others were used as the training data and development data. The word unit of these datasets is the short unit.

Please refer to the following link. http://www.ar.media.kyoto-u.ac.jp/mori/research/topics/PST/NextNLP.html

Thank you!

Isa-rentacs commented 4 years ago

Thank you for your quick reply, appreciated. Closing this issue as I don't have more questions, thanks!