taishi-i / nagisa

A Japanese tokenizer based on recurrent neural networks
https://huggingface.co/spaces/taishi-i/nagisa-demo
MIT License
379 stars 22 forks source link

dict_file format #16

Closed KoichiYasuoka closed 5 years ago

KoichiYasuoka commented 5 years ago

Now I'm trying to train with nagisa.fit by UD_Classical_Chinese-Kyoto (漢文) under 4-level classified-POS-system (4階層品詞). See my blog what I tried. I found dict_file parameter in nagisa.fit and I guessed it an outer dictionary (外部辞書). But I could not find any explanation or usage of the dict_file in your document. How do I use dict_file? Does it (and train_file) support classified-POS-system?

taishi-i commented 5 years ago

Thank you for using nagisa. The dict_file parameter is used as an outer dictionary, as you guessed. I'm sorry I didn't fill out the details in my document.

Could you refer to sample.dict in sample_datasets. The dict_file is consists of tab delimiters (word\tpostag) and it will be effective for classifying POS tags.

taishi-i commented 5 years ago

I saw your blog and ran a program to train UD_Classical_Chinese-Kyoto. By tuning the hyperparameters, I was able to obtain even better the test POS-tagging f1-score. If you are interested in it, please run the program below.

nagisa.fit(train_file="lzh_kyoto-ud-train.txt",dev_file="lzh_kyoto-ud-dev.txt",test_file="lzh_kyoto-ud-test.txt",dict_file="lzh_udkanbun-dict.txt",model_name="lzh_kyoto-nagisa", dim_tagemb=32, decay=3)

Epoch LR Loss Time_m DevWS_f1 DevPOS_f1 TestWS_f1 TestPOS_f1 1 0.100 4.976 0.280 97.69 84.81 98.42 87.09
2 0.100 2.277 0.254 97.62 87.16 98.42 87.09
3 0.100 1.895 0.258 97.18 87.06 98.42 87.09
4 0.050 1.664 0.256 97.12 87.70 98.42 87.09
5 0.050 1.344 0.289 97.72 88.57 98.22 90.33
6 0.050 1.228 0.262 97.46 89.18 98.22 90.33
7 0.050 1.142 0.253 97.48 88.95 98.22 90.33
8 0.025 1.091 0.251 97.37 89.48 98.22 90.33
9 0.025 0.957 0.251 97.55 89.31 98.22 90.33
10 0.025 0.922 0.256 97.52 89.61 98.22 90.33