Closed alvations closed 2 weeks ago
Merging this first so that testing can work in nltk
main repo
Thanks @alvations! I am not sure which PR to review, since the contents of this one seem already deprecated.
However, regarding the common structure of the perceptron taggers, instead of splitting each pickle into 3 json files, it could be advantageous to keep each tagger in one file. This would closely mimic the original data, and minimize the modifications needed in the handlers.
The following code converts each pickle to one json file, and verifies that the new package produces the same object as the original pickle:
#!/bin/env python
from nltk.data import load
import json
from os.path import basename, splitext
pickles = [
'taggers/averaged_perceptron_tagger/averaged_perceptron_tagger.pickle',
'taggers/averaged_perceptron_tagger_ru/averaged_perceptron_tagger_ru.pickle'
]
for file in pickles:
print(f"Loading {file}")
p = load(file) # load the pickle
(a, b, c) = p # it is a triple
c = sorted(list(c)) # c is a set, so json cannot serialize
fjson = splitext(basename(file))[0] + ".json"
with open(fjson, 'w') as f:
json.dump((a,b,c), f) # dump the serializable triple
with open(fjson, 'r') as f:
(a, b, c) = json.load(f) # load the dump
c = set(c) # convert to set
p2 = (a, b, c)
print(f"{fjson}: {p == p2}") # new data is identical to original pickle
Loading taggers/averaged_perceptron_tagger/averaged_perceptron_tagger.pickle
averaged_perceptron_tagger.json: True
Loading taggers/averaged_perceptron_tagger_ru/averaged_perceptron_tagger_ru.pickle
averaged_perceptron_tagger_ru.json: True
yeah, the nltk_data PRs are hard to "review" before we see it tests out in NLTK main repo. It has to be here for the tests to work on https://github.com/nltk/nltk/pull/3270
We can track and communicate even about the data on https://github.com/nltk/nltk/pull/3270 and then in nltk_data repo, just monitor just in case anyone sneak in sleepy pickles or eval()
-able code.
One pro of splitting them up is to minimize parsing of the files and just do json.load(fin)
and files are humanly easier to read when split up.
Main con is it has to be loaded one per json file, 3x per tagger.
I think for now, we can keep the structure of the data and get over the pickle issues first, then come back and refactor the data structure. Meanwhile, I'll create a github issue on this repo to keep track and get more feedback from @stevenbird and the rest of the community.
BTW, to test the serialized and load jsons, we have to do it on the other main repo, with the unit and doctests. A little odd to keep them separated but it actually works well for us cos access to this repo is a lot more tight than the other main one =)
Note on the classes, the order of the classes is tied to how the mappings work, the classes are saved as set
in pickle but since the class and hashes are saved together with the object in pickle doing list(set(classes))
is deterministic if loaded from pickle. And unfortunately, when doing it in json instead of pickle, it's not sorted(classes)
nor list(classes)
, the right order of the class labels are repr(classes)
, otherwise the doctest don't reach the accuracy as expected.
This is added as an alternative format to the pickled Average Perceptron Tagger.
The
packages/taggers/averaged_perceptron_tagger_eng.zip
file contains:The
packages/taggers/averaged_perceptron_tagger_rus.zip
file contains: