kyunghyuncho / NMT

1 stars 1 forks source link

Large vocabulary #4

Open sebastien-j opened 9 years ago

sebastien-j commented 9 years ago

@bartvm, @rizar,

I have a few questions about Blocks in order to be able to implement the large-vocabulary models.

How can I access and modify the parameters of the training algorithms (eg the running averages of Adadelta)? Is it possible to save these when checkpointing?

How do I do some operations before every batch? Mainloop extensions?

bartvm commented 9 years ago

If you want to change things, extensions are almost always the answer. They have access to the main loop, and through it, to the algorithm, data stream, etc. They answer to callbacks, one of these is the before_batch callback. You can inherit from TrainingExtensions and define the before_batch method. If you then pass the extension to the main loop, it will be called right before each batch is processed.

Everything is saved when checkpointing. Resuming a model restores the main loop to exactly the state it was in before saving. That includes the values of all Theano shared variables.

There is no unified way of accessing particular variables of an algorithm though. If you need access to the Adadelta parameters for any particular reason, you'll need to change the algorithm so that it makes these shared variables accessible through an attribute.

sebastien-j commented 9 years ago

Ok, thanks.

Comments in stream.py mention that we should use caching and multiprocessing to accelerate the data stream? Would batch order still be deterministic in this case?

bartvm commented 9 years ago

Yeah, I meant multiprocessing in the sense of separating the data reading in one process and the training in another, so nothing happens to the determinism of the data.

bartvm commented 9 years ago

By the way, the multiprocessing doesn't support checkpointing fully yet, so you might want to hold off for a bit if that matters.

rizar commented 9 years ago

There is no unified way of accessing particular variables of an algorithm though. If you need access to the Adadelta parameters for any particular reason, you'll need to change the algorithm so that it makes these shared variables accessible through an attribute.

Or you can assign those meaningful names and/or roles and retrieve them using VariableFilter from ComputationGraph(algorithm.steps). I think that will be the preferred way in Blocks.

sebastien-j commented 9 years ago

How does Blocks save (or deals with) file or file handles?

In the original large-vocabulary implementation, before training, I was going over the data once, storing batch ids and word mapping in shelve files.

Other implementations could be a bit faster, for example by starting training before visiting the entire dataset, but doing so is a bit messier. In any case, that preprocessing step was fairly short for a single epoch (about 1 hour for English->French if I remember correctly), but could get costly if the dataset was reshuffled before every epoch.

bartvm commented 9 years ago

Pickling of files is handled by picklable_itertools. It saves the location it was reading, and when unpickling it re-opens the file and seeks back to the location it was at: https://github.com/dwf/picklable_itertools/blob/master/picklable_itertools/iter_dispatch.py#L71-L78

What preprocessing do you need to do exactly? An hour sounds very long?

sebastien-j commented 9 years ago

I was going though the full training data, saving when to change the vocabularies as well as the token id mappings between the full vocabulary and the current one. When training goes on for more than a week, 1 hour is negligible.

bartvm commented 9 years ago

Okay, guess it doesn't matter much then. I might still suggest dumping shelve and using cPickle directly instead, it might be quite a bit faster assuming that you are loading your entire dictionary into memory. Also be sure to pass the protocol=2 parameter.

Anyway, the vocabulary-mapping is probably best done using a Transformer in Fuel, should be straightforward. Then just an extension that runs the before_epoch callback to shuffle the file, create the dictionary mappings, and pass these to the transformer (data stream), and an after_batch callback that transfers the word embeddings.

sebastien-j commented 9 years ago

I am not sure about using cPickle instead of shelve. On English-French (single epoch), the total size of the mappings was about 3 GB. This is not very large, but in some cases we may want that memory for something else.

For the shelve to not slow down training much, I had a small dictionary/set that contained the ids of the batches at which there was a change, and only used the shelve when needed.

bartvm commented 9 years ago

Okay, I don't think I really understand. Don't you have a separate mapping for each section? So then you would just store each mapping into a separate pickle file, and unpickle that at the beginning of a section.

sebastien-j commented 9 years ago

That would be possible, but there would be a lot of small files. Is that really preferable to one or a few large ones?

One disadvantage of shelve is its dependency on dbm. As far as I know, shelves created with different implementations of dbm are not directly compatible, although converting files is possible.

sebastien-j commented 9 years ago

@rizar, is it normal that the only variables in ComputationGraph(algorithm.steps) are the model parameters?

rizar commented 9 years ago

No, it's not. Do you mean that if you do

>>> cg = ComputationGraph(algorithm.steps)
>>> cg.variables 

then you simply get reordered cg.params?

sebastien-j commented 9 years ago

Yes (with cg.parameters).

rizar commented 9 years ago

Can you show me the code that create the algorithm?

sebastien-j commented 9 years ago
# Set up training algorithm
algorithm = GradientDescent(
    cost=cost, params=cg.parameters,
    step_rule=CompositeRule([StepClipping(10), AdaDelta()])
)
rizar commented 9 years ago

I guess what you need is actually ComutationGraph(algorithms.steps.values()).