Closed nschneid closed 10 years ago
Also, avoid storing the entire test dataset in memory (which seems to be happening now).
Cf. args.disk
, which applies to the training data (not yet implemented).
This will require some refactoring of decode()
, which currently takes the dataset object as a parameter and yields the weights on each iteration.
Options:
decode()
, controlled by arguments?decode()
yield each decoded instance during prediction?decode()
simply yield
once it finishes each instance, and from main do:decoder = t.decode(testData, ...)
for _,(sent,o0Feats) in izip(decoder, testData):
# print predictions
# delete instance
Alternative to the above:
Split decode()
and _viterbi()
into:
learn()
—only used during training; iterates over the dataset multiple times, stores predictions in that dataset, and calls _perceptronUpdate()
. Yields the model weights after each iteration and averages them after the last iteration. Called by train()
.
train()
.decode()
, which maintains statistics over decodings—prints progress information every so often and accuracy where applicable. Receives each instance and then yields the predictions for that instance. Called by learn()
during training and decode_dataset()
during test/prediction._viterbi()
, which sets up DP tables, accepts instances one by one, and yields the predictions. Called by decode()
.
decode_dataset()
, which makes a pass through the eval (test) or predict data given the current model and optionally prints the predictions. Called by main()
.These changes should allow the non-training SupersenseDataSet
to be constructed with keep_in_memory=False
.
I think implementing steps 2 and 3 as coroutines (that receive single instances using send()
), rather than typical generators (that iterate over all instances), will simplify the creation of a server mode in which instances are processed on demand.
Nice tutorial on coroutines: http://www.dabeaz.com/coroutines/Coroutines.pdf
Create a
--predict
mode that, unlike--test-predict
, does not require a (meaningless) third input column and does not output meaningless "accuracy" information.