Added tagger in json format

alvations commented 2 weeks ago

This is added as an alternative format to the pickled Average Perceptron Tagger.

The packages/taggers/averaged_perceptron_tagger_eng.zip file contains:

averaged_perceptron_tagger_eng.classes.json (deflated 57%)
averaged_perceptron_tagger_eng.tagdict.json (deflated 70%)
averaged_perceptron_tagger_eng.weights.json (deflated 73%)

The packages/taggers/averaged_perceptron_tagger_rus.zip file contains:

averaged_perceptron_tagger_rus.classes.json (deflated 73%)
averaged_perceptron_tagger_rus.tagdict.json (deflated 86%)
averaged_perceptron_tagger_rus.weights.json (deflated 80%)

alvations commented 2 weeks ago

Merging this first so that testing can work in nltk main repo

ekaf commented 2 weeks ago

Thanks @alvations! I am not sure which PR to review, since the contents of this one seem already deprecated.

However, regarding the common structure of the perceptron taggers, instead of splitting each pickle into 3 json files, it could be advantageous to keep each tagger in one file. This would closely mimic the original data, and minimize the modifications needed in the handlers.

The following code converts each pickle to one json file, and verifies that the new package produces the same object as the original pickle:

#!/bin/env python

from nltk.data import load
import json
from os.path import basename, splitext

pickles = [
    'taggers/averaged_perceptron_tagger/averaged_perceptron_tagger.pickle',
    'taggers/averaged_perceptron_tagger_ru/averaged_perceptron_tagger_ru.pickle'
    ]

for file in pickles:
    print(f"Loading {file}")
    p = load(file) # load the pickle
    (a, b, c) = p  # it is a triple
    c = sorted(list(c)) # c is a set, so json cannot serialize
    fjson = splitext(basename(file))[0] + ".json"
    with open(fjson, 'w') as f:
        json.dump((a,b,c), f) # dump the serializable triple
    with open(fjson, 'r') as f:
        (a, b, c) = json.load(f) # load the dump
    c = set(c) # convert to set
    p2 = (a, b, c)
    print(f"{fjson}: {p == p2}") # new data is identical to original pickle

Loading taggers/averaged_perceptron_tagger/averaged_perceptron_tagger.pickle
averaged_perceptron_tagger.json: True
Loading taggers/averaged_perceptron_tagger_ru/averaged_perceptron_tagger_ru.pickle
averaged_perceptron_tagger_ru.json: True

alvations commented 2 weeks ago

yeah, the nltk_data PRs are hard to "review" before we see it tests out in NLTK main repo. It has to be here for the tests to work on https://github.com/nltk/nltk/pull/3270

We can track and communicate even about the data on https://github.com/nltk/nltk/pull/3270 and then in nltk_data repo, just monitor just in case anyone sneak in sleepy pickles or eval()-able code.

alvations commented 2 weeks ago

One pro of splitting them up is to minimize parsing of the files and just do json.load(fin) and files are humanly easier to read when split up.

Main con is it has to be loaded one per json file, 3x per tagger.

I think for now, we can keep the structure of the data and get over the pickle issues first, then come back and refactor the data structure. Meanwhile, I'll create a github issue on this repo to keep track and get more feedback from @stevenbird and the rest of the community.

BTW, to test the serialized and load jsons, we have to do it on the other main repo, with the unit and doctests. A little odd to keep them separated but it actually works well for us cos access to this repo is a lot more tight than the other main one =)

Note on the classes, the order of the classes is tied to how the mappings work, the classes are saved as set in pickle but since the class and hashes are saved together with the object in pickle doing list(set(classes)) is deterministic if loaded from pickle. And unfortunately, when doing it in json instead of pickle, it's not sorted(classes) nor list(classes), the right order of the class labels are repr(classes), otherwise the doctest don't reach the accuracy as expected.

nltk / nltk_data

Added tagger in json format #208