jacksonllee / pycantonese

Cantonese Linguistics and NLP
https://pycantonese.org
MIT License
354 stars 38 forks source link

Replace tagger.pickle with a JSON file. #36

Closed edong closed 1 year ago

edong commented 1 year ago

Replace tagger.pickle with a JSON file, tagger.json.

Benefits:

The tagger.json file was generated from tagger.pickle as follows:

 >>> import json
 >>> import pickle
 >>> x = pickle.load(open("tagger.pickle", "rb"))
 >>> json.dump(
 ...     {
 ...         'weights': x[0],
 ...         'tagdict': x[1],
 ...         'classes': sorted(list(x[2])),
 ...     },
 ...     open('tagger.json', 'w', encoding='utf-8'),
 ...     ensure_ascii=False, indent=2, sort_keys=True)

Running train_tagger.py at HEAD (commit 2e0fc06cee0e80fd8a89606016a1b66b848ac4c9) without changes already results in changes to tagger.pickle, so this pull request only converts tagger.pickle to tagger.json without generating a new model.

Running train_tagger.py with these changes produces a different model, but with the same JSON format present in the tagger.json in this pull request.


Thank you for submitting a pull request to improve this library! Please complete the following items (you may create the pull request first and then work through them by pushing additional commits to your branch):

edong commented 1 year ago

Thanks for the comments! I've reverted the documentation changes.

jacksonllee commented 1 year ago

Just added you to the acknowledgments section in the readme. Thanks again!