stanfordnlp / stanza

Stanford NLP Python library for tokenization, sentence segmentation, NER, and parsing of many human languages
https://stanfordnlp.github.io/stanza/
Other
7.28k stars 893 forks source link

Can I use custom dictionary in stanfordnlp? #73

Closed lfzhagn closed 5 years ago

lfzhagn commented 5 years ago

Hi! I have some questions about using custom dictionary in stanfordnlp.

  1. Can I use my own dictionary when I tokenize sentences using pipline?
  2. I know stanfordnlp provides a Python wrapper for the Java Stanford CoreNLP Server, which can help me extract some name entities. Can I add my custom dictionary when I use the CoreNLPClient? I know this can be done in Java code using Stanford CoreNLP, I want to know if I can achieve this in Python using stanfordnlp?

Here, when refer custom dictionary, I mean something like this. `#my_dictionary.txt

上官婉儿 /nr
阿莫西林 /mhd
......`

Thanks a lot if you can give me some answers. :)

qipeng commented 5 years ago

For the neural Pipeline, unfortunately this is not an option yet.

For the CoreNLP client, if you are able to specify this dictionary file through CoreNLP properties files, you should be able to do the same with the client.

lfzhagn commented 5 years ago

@qipeng Now I use a python dict properties to specify parameters used in CoreNLP client. My code is like this, `#code in pyhton

properties={
         ...
        "tokenize.language": "zh",
        "segment.model": "edu/stanford/nlp/models/segmenter/chinese/ctb.gz",
        "segment.sighanCorporaDict": "edu/stanford/nlp/models/segmenter/chinese",
        "segment.serDictionary": "edu/stanford/nlp/models/segmenter/chinese/dict-chris6.ser.gz",
        "segment.sighanPostProcessing": "true",
        ...
    }

with CoreNLPClient(annotators=['ner'], timeout=90000, memory='16G',properties=properties) as client:
    annotated = client.annotate(text)
    ...

`

My custom dictionaries are like
chinese_medicine_name.txt chinese_person_name.txt

It seems that the segment.serDictionary only allow one value to follow. Can I specify my dictionaries using python code?


You suggest me to modify the properties files. my understanding is to unpack the stanford-chinese-corenlp-2018-10-05-models.jar, add my custom dict path in the StanfordCoreNLP-chinese.properties, and then re-compress them?

Very grateful for you reply :)

qipeng commented 5 years ago

@J38 probably knows more about setting custom dictionaries.

But to use your own, it should probably suffice to just pack your own dictionary files with some identifiable path into a jar file and add that to the classpath used to run the CoreNLP server, and load the dictionary files similarly.

lfzhagn commented 5 years ago

Thanks a lot! Your explanation is quite clear :)

LeonSpark commented 4 years ago

This is my solution:

  1. Download stanford-segmenter-2018-10-16.zip from the official site.
  2. Unzip it to get the ChineseDictionary tool.(edu.stanford.nlp.wordseg.ChineseDictionary)
  3. Add custom dictionary files (eg. places.txt) with one place name as a line and no more than 6 words.
  4. Extend the existing dictionary located in folder "data/dict-chris6.ser.gz" by command: java edu.stanford.nlp.wordseg.ChineseDictionary -inputDicts data/dict-chris6.ser.gz,places.txt -output dict-chris6.ser.gz
  5. Replace edu/stanford/nlp/models/segmenter/chinese/dict-chris6.ser.gz with this file.
  6. jar cvf stanford-chinese-corenlp-2018-10-05-models.jar *
lfzhagn commented 4 years ago

Thank you very much! That really helps a lot.👍

Li,Yang notifications@github.com 于2019年12月19日周四 下午2:26写道:

This is my solution:

  1. Download stanford-segmenter-2018-10-16.zip https://nlp.stanford.edu/software/segmenter.shtml from the official site.
  2. Unzip it to get the ChineseDictionary tool.(edu.stanford.nlp.wordseg.ChineseDictionary)
  3. Add custom dictionary files (eg. places.txt) with one place name as a line and no more than 6 words.
  4. Extend the existing dictionary located in folder "data/dict-chris6.ser.gz" by command: java edu.stanford.nlp.wordseg.ChineseDictionary -inputDicts data/dict-chris6.ser.gz,places.txt -output dict-chris6.ser.gz
  5. Replace edu/stanford/nlp/models/segmenter/chinese/dict-chris6.ser.gz with this file.
  6. jar cvf stanford-chinese-corenlp-2018-10-05-models.jar *

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/stanfordnlp/issues/73?email_source=notifications&email_token=ALHAH57JKYFSWG5A3HFWA5DQZMHZ3A5CNFSM4HG474R2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEHISDWY#issuecomment-567353819, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALHAH5Y3ZWGR4TFHJKX62HDQZMHZ3ANCNFSM4HG474RQ .