megagonlabs / ginza

A Japanese NLP Library using spaCy as framework based on Universal Dependencies
MIT License
750 stars 57 forks source link

ja_ginza model: how trained? #73

Closed kissge closed 4 years ago

kissge commented 4 years ago

I'd like to ask something about the 'ja_ginza' model provided from this repo. Currently it contains a pretrained NER model, but I couldn't find documents mentioning how, and/or on what documents, it was trained. Where can I find one?

Thanks.

hiroshi-matsuda-rit commented 4 years ago

@kissge Sorry for late reply. I've been changing the NE training data-set from KWDLC to GSK2014-A with BCCWJ.

We used Language.update() of spaCy API for the NE spans of KWDLC to train ner model independently until GiNZA v2.2.1. (The parser model is trained before the ner.) https://spacy.io/api/language#update https://github.com/megagonlabs/ginza/blob/6a667efca0edc7c628402c53f5c61742c0739ed0/shell/train_ner.sh https://github.com/megagonlabs/ginza/blob/6a667efca0edc7c628402c53f5c61742c0739ed0/ginza_util/train_ner.py

From the next release of GiNZA, we'd use spacy train command with json formatted gold data for ner training. We're using the UD_Japanese-BCCWJ aligned with GSK2014-A. Please see this branch if you want to understand this process. https://github.com/megagonlabs/ginza/blob/change_ner_corpus_to_gsk2014a/ginza_util/gsk2014a.py