Open sebastien-j opened 9 years ago
If you want to change things, extensions are almost always the answer. They have access to the main loop, and through it, to the algorithm, data stream, etc. They answer to callbacks, one of these is the before_batch
callback. You can inherit from TrainingExtensions
and define the before_batch
method. If you then pass the extension to the main loop, it will be called right before each batch is processed.
Everything is saved when checkpointing. Resuming a model restores the main loop to exactly the state it was in before saving. That includes the values of all Theano shared variables.
There is no unified way of accessing particular variables of an algorithm though. If you need access to the Adadelta
parameters for any particular reason, you'll need to change the algorithm so that it makes these shared variables accessible through an attribute.
Ok, thanks.
Comments in stream.py
mention that we should use caching and multiprocessing to accelerate the data stream? Would batch order still be deterministic in this case?
Yeah, I meant multiprocessing in the sense of separating the data reading in one process and the training in another, so nothing happens to the determinism of the data.
By the way, the multiprocessing doesn't support checkpointing fully yet, so you might want to hold off for a bit if that matters.
There is no unified way of accessing particular variables of an algorithm though. If you need access to the Adadelta parameters for any particular reason, you'll need to change the algorithm so that it makes these shared variables accessible through an attribute.
Or you can assign those meaningful names and/or roles and retrieve them using VariableFilter
from ComputationGraph(algorithm.steps)
. I think that will be the preferred way in Blocks.
How does Blocks save (or deals with) file or file handles?
In the original large-vocabulary implementation, before training, I was going over the data once, storing batch ids and word mapping in shelve
files.
Other implementations could be a bit faster, for example by starting training before visiting the entire dataset, but doing so is a bit messier. In any case, that preprocessing step was fairly short for a single epoch (about 1 hour for English->French if I remember correctly), but could get costly if the dataset was reshuffled before every epoch.
Pickling of files is handled by picklable_itertools
. It saves the location it was reading, and when unpickling it re-opens the file and seeks back to the location it was at: https://github.com/dwf/picklable_itertools/blob/master/picklable_itertools/iter_dispatch.py#L71-L78
What preprocessing do you need to do exactly? An hour sounds very long?
I was going though the full training data, saving when to change the vocabularies as well as the token id mappings between the full vocabulary and the current one. When training goes on for more than a week, 1 hour is negligible.
Okay, guess it doesn't matter much then. I might still suggest dumping shelve
and using cPickle
directly instead, it might be quite a bit faster assuming that you are loading your entire dictionary into memory. Also be sure to pass the protocol=2
parameter.
Anyway, the vocabulary-mapping is probably best done using a Transformer
in Fuel, should be straightforward. Then just an extension that runs the before_epoch
callback to shuffle the file, create the dictionary mappings, and pass these to the transformer (data stream), and an after_batch
callback that transfers the word embeddings.
I am not sure about using cPickle
instead of shelve. On English-French (single epoch), the total size of the mappings was about 3 GB. This is not very large, but in some cases we may want that memory for something else.
For the shelve to not slow down training much, I had a small dictionary/set that contained the ids of the batches at which there was a change, and only used the shelve when needed.
Okay, I don't think I really understand. Don't you have a separate mapping for each section? So then you would just store each mapping into a separate pickle file, and unpickle that at the beginning of a section.
That would be possible, but there would be a lot of small files. Is that really preferable to one or a few large ones?
One disadvantage of shelve is its dependency on dbm
. As far as I know, shelves created with different implementations of dbm
are not directly compatible, although converting files is possible.
@rizar, is it normal that the only variables in ComputationGraph(algorithm.steps)
are the model parameters?
No, it's not. Do you mean that if you do
>>> cg = ComputationGraph(algorithm.steps)
>>> cg.variables
then you simply get reordered cg.params
?
Yes (with cg.parameters
).
Can you show me the code that create the algorithm?
# Set up training algorithm
algorithm = GradientDescent(
cost=cost, params=cg.parameters,
step_rule=CompositeRule([StepClipping(10), AdaDelta()])
)
I guess what you need is actually ComutationGraph(algorithms.steps.values())
.
@bartvm, @rizar,
I have a few questions about Blocks in order to be able to implement the large-vocabulary models.
How can I access and modify the parameters of the training algorithms (eg the running averages of Adadelta)? Is it possible to save these when checkpointing?
How do I do some operations before every batch? Mainloop extensions?