neulab / xnmt

eXtensible Neural Machine Translation
Other
184 stars 44 forks source link

Refactor experiments / resume crashed trainings #367

Open msperber opened 6 years ago

msperber commented 6 years ago

It would be nice to do the following:

This would allow the following:

A config file would probably look like this for a config file with a single experiment:

!SimpleExperiment
  preproc: ..
    state: ..
  train: ..
    state: .. # the state is usually not given in the config file, 
                 #but included when the model is written out
  evaluate: ..
  state: ..

Or for a series of experiments (which is more similar to the current config files which always contain a series of experiments):

!ExperimentSeries
  experiments:
  - !SimpleExperiment
      state: ..
  - !SimpleExperiment ..

I believe this would be relatively easy to do, the main thing I'm not sure about how to handle best is that we would no longer have experiment names so {EXP} may no longer work.

neubig commented 6 years ago

I agree that this would be nice. And I actually don't understand why we couldn't have experiment names? I think we could have two options for syntax:

!SimpleExperiment
  name: my_name
  ...

or

my_name: !SimpleExperiment
  ...

If we choose the latter, the serializer could check that the top level in the dictionary only has one element and that it is of type experiment.

neubig commented 5 years ago

Is this fixed now? I'm not sure...

msperber commented 5 years ago

No, I think nothing has been done along these lines yet.

msperber commented 5 years ago

Making config files and saved experiments compatible has been implemented by #491.

Some thoughts on what would need to be done to support resuming crashed experiments:

philip30 commented 5 years ago

I think having one model per checkpoint is very reasonable. For example, tensorflow also do the same thing. Or if our concern is the disk space, maybe we can add flag to turn off this setting with the consequence that we can't resume the training.