nschneid / pysupersensetagger

AMALGrAM, an English supersense tagger written in Python
GNU General Public License v3.0
33 stars 12 forks source link

--predict #2

Closed nschneid closed 10 years ago

nschneid commented 10 years ago

Create a --predict mode that, unlike --test-predict, does not require a (meaningless) third input column and does not output meaningless "accuracy" information.

nschneid commented 10 years ago

Also, avoid storing the entire test dataset in memory (which seems to be happening now).

Cf. args.disk, which applies to the training data (not yet implemented).

nschneid commented 10 years ago

This will require some refactoring of decode(), which currently takes the dataset object as a parameter and yields the weights on each iteration.

Options:

decoder = t.decode(testData, ...)
for _,(sent,o0Feats) in izip(decoder, testData):
  # print predictions
  # delete instance
nschneid commented 10 years ago

Alternative to the above:

Split decode() and _viterbi() into:

  1. a generator, learn()—only used during training; iterates over the dataset multiple times, stores predictions in that dataset, and calls _perceptronUpdate(). Yields the model weights after each iteration and averages them after the last iteration. Called by train().
    • hang on...maybe all this functionality can be moved directly into train().
    • I guess the advantage of the decomposition is that the production of a new weight vector (by per-instance updates) is separated from the stuff that is done with those weights as a whole (saving them, deciding whether to stop early).
    • something to consider supporting: early stopping after a given number of timesteps, rather than iterations
  2. a coroutine, decode(), which maintains statistics over decodings—prints progress information every so often and accuracy where applicable. Receives each instance and then yields the predictions for that instance. Called by learn() during training and decode_dataset() during test/prediction.
  3. a coroutine, _viterbi(), which sets up DP tables, accepts instances one by one, and yields the predictions. Called by decode().
    • the point of using a coroutine rather than just a function is to maintain the DP tables locally across instances
  4. a function, decode_dataset(), which makes a pass through the eval (test) or predict data given the current model and optionally prints the predictions. Called by main().

These changes should allow the non-training SupersenseDataSet to be constructed with keep_in_memory=False.

I think implementing steps 2 and 3 as coroutines (that receive single instances using send()), rather than typical generators (that iterate over all instances), will simplify the creation of a server mode in which instances are processed on demand.

nschneid commented 10 years ago

Nice tutorial on coroutines: http://www.dabeaz.com/coroutines/Coroutines.pdf