Replace tagger.pickle with a JSON file.

edong commented 1 year ago

Replace tagger.pickle with a JSON file, tagger.json.

Benefits:

JSON, being plain text, is more transparent and easier to inspect and review, and alleviates security concerns regarding unpickling data: https://docs.python.org/3/library/pickle.html https://www.benfrederickson.com/dont-pickle-your-data/

The tagger.json file was generated from tagger.pickle as follows:

 >>> import json
 >>> import pickle
 >>> x = pickle.load(open("tagger.pickle", "rb"))
 >>> json.dump(
 ...     {
 ...         'weights': x[0],
 ...         'tagdict': x[1],
 ...         'classes': sorted(list(x[2])),
 ...     },
 ...     open('tagger.json', 'w', encoding='utf-8'),
 ...     ensure_ascii=False, indent=2, sort_keys=True)

Running train_tagger.py at HEAD (commit 2e0fc06cee0e80fd8a89606016a1b66b848ac4c9) without changes already results in changes to tagger.pickle, so this pull request only converts tagger.pickle to tagger.json without generating a new model.

Running train_tagger.py with these changes produces a different model, but with the same JSON format present in the tagger.json in this pull request.

Thank you for submitting a pull request to improve this library! Please complete the following items (you may create the pull request first and then work through them by pushing additional commits to your branch):

[x] Add a concise title to this pull request on the GitHub web interface.
[x] Add a description in this box to describe what this pull request is about.
[ ] If code behavior is being updated (e.g., a bug fix), relevant tests should be added.
[ ] The CircleCI builds should pass, including both the code styling checks by black and flake8 as well as the test suite.
[x] Add an entry to CHANGELOG.md at the repository's root level.

edong commented 1 year ago

Thanks for the comments! I've reverted the documentation changes.

jacksonllee commented 1 year ago

Just added you to the acknowledgments section in the readme. Thanks again!

jacksonllee / pycantonese

Replace tagger.pickle with a JSON file. #36