Closed akshatdewan closed 4 years ago
Yes, this config trains from scratch. There is also a small conv NN before the first encoder LSTM layer. Despite that, and SpecAugment, I think they are quite similar. Maybe make a diff to see the exact differences. The training time in this config is still the same number of epochs (12.5 full epochs), and the training speed should be similar (SpecAugment is very fast; the conv NN adds some small overhead here). Note though that with SpecAugment, you can in principle train much longer, until convergence, i.e. make the learning rate scheduling more conservative. By training e.g. twice as long (25 full epochs), or even longer, you still can gain lots of improvement. This is also what the original SpecAugment paper reports. They train for 600 full epochs! (One month of training time on a TPU cluster...) But even when training as long as before (12.5 epochs), you still should see lots of improvement with SpecAugment.
Note that this config is not described in the corresponding paper, as I only later added it, for reference. But it is described (briefly) in our 2019 ASRU paper. (More configs here.)
Thanks! This sounds great, I will see how my training goes. Will post any updates here.
train-scores.data.txt I just started a specAug training with bas2.conv2l.specaug.curric3, and I am seeing some thing curious here - the dev_error_ctc does not seem to be going down as fast as in my other trainings (without specAug). It is still at 0.92 after 26 "epochs". Am i reading it wrong or is it kind of expected? Thanks!
Please disregard my previous comment, I think my data might be a mess.
Hello!
In order to reproduce your results, I tried training only with librispeech data. I used: returnn commit https://github.com/rwth-i6/returnn/commit/bea4cb578a8c93c7d59a4d7e4898dc3eeaa042d0 bea4cb578a8c93c7d59a4d7e4898dc3eeaa042d0 returnn-experiments commit 98cea81963626f7136f6124635cc6e1e9022e862 base2.conv2l.specaug.curric3.config.
The train-scores.data does not show a lot of improvement and the training breaks at 53 "epochs" with the below error:
Model seems broken, got inf or nan score.
Accumulated scores: NumbersDict({'error:decision': 0.0, 'cost:ctc': inf, 'cost:output/output_prob': 304599.4338989258, 'error:ctc': 72092.0, 'error:output/output_prob': 57407.0, 'loss': inf})
Exception Exception('Inf/nan score in step 282.',) in step 282.
Are you familiar with this issue? Thanks!
Can you check #34? There was a similar discussion.
Thanks Albert! I am using Tesla V100 SXM2 32 GB for my experiments. TF 1.12. I ran experiments with higher number of steps (15 instead of 10) in learning rate warming up. Also reduced the min learning rate to 0.0001. I will let it run for a few epochs and see.
Hello, after changing the number of steps in warmup, I do not receive nan's anymore but the training still seems to be off. Does this train-scores.txt make sense to you? Is it consistent with your experiments? The error rate is not really coming down very fast.
It looks like it never really converges properly (from those scores). E.g. your last epoch:
88: EpochData(learningRate=0.00020334926626632013, error={
'dev_error_ctc': 0.9141972329849649,
'dev_error_decision': 0.0,
'dev_error_output/output_prob': 0.7254169391549435,
'dev_score_ctc': 6.506660251327039,
'dev_score_output/output_prob': 3.7826016050155156,
'train_error_ctc': 0.9481592299677715,
'train_error_decision': 0.0,
'train_error_output/output_prob': 0.7328658090259365,
'train_score_ctc': 6.724799936911294,
'train_score_output/output_prob': 3.7869126781015656,
}),
The CTC score/error should be much lower. The CTC error definitely below 50%. This should happen also fairly early in the training, maybe after 10 epochs (depending on the epoch split, but e.g. such that one epoch corresponds to ~50h train data).
Is this now for the original LibriSpeech data, or your own data? I would double check that your data is correct. Use e.g. dump-dataset. Also compare that to a run with the original LibriSpeech data.
Play around with pretraining more (see custom_construction_algo
). Let it start with 2 BLSTM encoder layers initially (StartNumLayers = 2
), and increase the number of repetitions for this first pretrain step (there is sth like idx = max(idx - 3, 0) # repeat first
, increase that, i.e. maybe idx = max(idx - 6, 0)
). You also have pretrain = {"repetitions": 5, ...}
, i.e. that means that the first 6 * 5 = 30 epochs, it will use that same network (2 encoder layers, 512 dims). This small encoder should converge fast, so 30 epochs should be more than enough. If the scores don't go down during these first 30 epochs, sth is wrong.
The above scores are on the original LibriSpeech dataset. I did run the dump-dataset in the past and seemed to have run fine. I will re-confirm that. Thanks, and I will play around with the custom_construction_algo.
You can also try to set the initial learning rate (of the learning rate warmup) lower, e.g. 0.0001.
I already tried that (lower initial learning rate) in another experiment but that did not help either.
I ran dump-dataset
and everything seems fine.
To re-validate my copy of the libri data, I ran another experiment with an older configl_file and the training error is coming down as expected. Here is the train-scores.data
Your suggestion of "increase the number of repetitions for this first pretrain step" does not seem to be converging either. base2.conv2l.specaug.curric3.txt train-scores.txt
Is there some glaring mistake in my config file? Or do I just need to play more with HPs?
What is the difference between your older config, and your new config? Can you post a diff? (Skip the SpecAugment part, if that is also included there.)
I am very embarrassed, I was mistakenly using the mean of audio features file for std_dev as well. I am redoing the experiment, I hope that it solves the issue.
Unfortunately, I see no improvement after fixing my mean std_dev bug either. Please find below a diff,
red is old
and green is new
config (after removing the functions for specAug from new
)
21c21
< if int(os.environ.get("DEBUG", "0")):
---
> if int(os.environ.get("RETURNN_DEBUG", "0")):
41a42
> "use_cache_manager": not debug_mode,
56a58
> 'use_new_filter': True,
175c177,178
< "source": {"class": "eval", "eval": "tf.clip_by_value(source(0), -3.0, 3.0)"},
---
> "source": {"class": "eval", "eval": "self.network.get_config().typed_value('transform')(source(0), network=self.network)"},
> "source0": {"class": "split_dims", "axis": "F", "dims": (-1, 1), "from": "source"}, # (T,40,1)
177,179c180,189
< "lstm0_fw" : { "class": "rec", "unit": "nativelstm2", "n_out" : LstmDim, "direction": 1, "from": ["source"] },
< "lstm0_bw" : { "class": "rec", "unit": "nativelstm2", "n_out" : LstmDim, "direction": -1, "from": ["source"] },
< "lstm0_pool": {"class": "pool", "mode": "max", "padding": "same", "pool_size": (2,), "from": ["lstm0_fw", "lstm0_bw"], "trainable": False},
---
> # Lingvo: ep.conv_filter_shapes = [(3, 3, 1, 32), (3, 3, 32, 32)], ep.conv_filter_strides = [(2, 2), (2, 2)]
> "conv0": {"class": "conv", "from": "source0", "padding": "same", "filter_size": (3, 3), "n_out": 32, "activation": None, "with_bias": True}, # (T,40,32)
> "conv0p": {"class": "pool", "mode": "max", "padding": "same", "pool_size": (1, 2), "from": "conv0"}, # (T,20,32)
> "conv1": {"class": "conv", "from": "conv0p", "padding": "same", "filter_size": (3, 3), "n_out": 32, "activation": None, "with_bias": True}, # (T,20,32)
> "conv1p": {"class": "pool", "mode": "max", "padding": "same", "pool_size": (1, 2), "from": "conv1"}, # (T,10,32)
> "conv_merged": {"class": "merge_dims", "from": "conv1p", "axes": "static"}, # (T,320)
>
> "lstm0_fw" : { "class": "rec", "unit": "nativelstm2", "n_out" : LstmDim, "direction": 1, "from": ["conv_merged"] },
> "lstm0_bw" : { "class": "rec", "unit": "nativelstm2", "n_out" : LstmDim, "direction": -1, "from": ["conv_merged"] },
> "lstm0_pool": {"class": "pool", "mode": "max", "padding": "same", "pool_size": (3,), "from": ["lstm0_fw", "lstm0_bw"], "trainable": False},
187c197
< "lstm2_pool": {"class": "pool", "mode": "max", "padding": "same", "pool_size": (2,), "from": ["lstm2_fw", "lstm2_bw"], "trainable": False},
---
> "lstm2_pool": {"class": "pool", "mode": "max", "padding": "same", "pool_size": (1,), "from": ["lstm2_fw", "lstm2_bw"], "trainable": False},
219c229
< "s": {"class": "rnn_cell", "unit": "LSTMBlock", "from": ["prev:target_embed", "prev:att"], "n_out": 1000}, # transform
---
> "s": {"class": "rec", "unit": "nativelstm2", "from": ["prev:target_embed", "prev:att"], "n_out": 1000}, # transform
255,257c265,266
< # We will first construct layer-by-layer, starting with 2 layers.
< # Initially, we will use a higher reduction factor, and at the end, we will reduce it.
< # Also, we will initially have not label smoothing.
---
> StartNumLayers = 2
> InitialDimFactor = 0.5
265,276c274,278
< num_lstm_layers = idx + 2 # idx starts at 0. start with 2 layers
< if idx == 0:
< net_dict["lstm%i_fw" % (orig_num_lstm_layers - 1)]["dropout"] = 0
< net_dict["lstm%i_bw" % (orig_num_lstm_layers - 1)]["dropout"] = 0
< if idx >= 1:
< num_lstm_layers -= 1 # repeat like idx=0, but now with dropout
< # We will start with a higher reduction factor initially, for better convergence.
< red_factor = 2 ** 5
< if num_lstm_layers == orig_num_lstm_layers + 1:
< # Use original reduction factor now.
< num_lstm_layers = orig_num_lstm_layers
< red_factor = orig_red_factor
---
> net_dict["#config"] = {}
> if idx < 4:
> net_dict["#config"]["batch_size"] = 15000
> idx = max(idx - 6, 0) # repeat first
> num_lstm_layers = idx + StartNumLayers # idx starts at 0. start with N layers
280,296c282,285
< # Use label smoothing only at the very end.
< net_dict["output"]["unit"]["output_prob"]["loss_opts"]["label_smoothing"] = 0
< # Other options during pretraining.
< if idx == 0:
< net_dict["#config"] = {"max_seq_length": {"classes": 60}}
< net_dict["#repetition"] = 10
< # Leave the last lstm layer as-is, but only modify its source.
< net_dict["lstm%i_fw" % (orig_num_lstm_layers - 1)]["from"] = ["lstm%i_pool" % (num_lstm_layers - 2)]
< net_dict["lstm%i_bw" % (orig_num_lstm_layers - 1)]["from"] = ["lstm%i_pool" % (num_lstm_layers - 2)]
< if red_factor > orig_red_factor:
< for i in range(num_lstm_layers - 2):
< net_dict["lstm%i_pool" % i]["pool_size"] = (2,)
< # Increase last pool-size to get the initial reduction factor.
< assert red_factor % (2 ** (num_lstm_layers - 2)) == 0
< last_pool_size = red_factor // (2 ** (num_lstm_layers - 2))
< # Increase last pool-size to get the same encoder-seq-length folding.
< net_dict["lstm%i_pool" % (num_lstm_layers - 2)]["pool_size"] = (last_pool_size,)
---
> if num_lstm_layers == 2:
> net_dict["lstm0_pool"]["pool_size"] = (orig_red_factor,)
> # Skip to num layers.
> net_dict["encoder"]["from"] = ["lstm%i_fw" % (num_lstm_layers - 1), "lstm%i_bw" % (num_lstm_layers - 1)]
298c287
< for i in range(num_lstm_layers - 1, orig_num_lstm_layers - 1):
---
> for i in range(num_lstm_layers, orig_num_lstm_layers):
301c290,302
< del net_dict["lstm%i_pool" % i]
---
> del net_dict["lstm%i_pool" % (i - 1)]
> # Thus we have layers 0 .. (num_lstm_layers - 1).
> layer_idxs = list(range(0, num_lstm_layers))
> layers = ["lstm%i_fw" % i for i in layer_idxs] + ["lstm%i_bw" % i for i in layer_idxs]
> grow_frac = 1.0 - float(orig_num_lstm_layers - num_lstm_layers) / (orig_num_lstm_layers - StartNumLayers)
> dim_frac = InitialDimFactor + (1.0 - InitialDimFactor) * grow_frac
> for layer in layers:
> net_dict[layer]["n_out"] = int(net_dict[layer]["n_out"] * dim_frac)
> if "dropout" in net_dict[layer]:
> net_dict[layer]["dropout"] *= dim_frac
> net_dict["enc_value"]["dims"] = (AttNumHeads, int(EncValuePerHeadDim * dim_frac * 0.5) * 2)
> # Use label smoothing only at the very end.
> net_dict["output"]["unit"]["output_prob"]["loss_opts"]["label_smoothing"] = 0
304c305
< pretrain = {"repetitions": 5, "construction_algo": custom_construction_algo} #reduced number of reps
---
> pretrain = {"repetitions": 5, "copy_param_mode": "subset", "construction_algo": custom_construction_algo}
312a314
> accum_grad_multiple_step = 2
315c317
< stop_on_nonfinite_train_score = False
---
> #stop_on_nonfinite_train_score = False
319c321,322
< learning_rates = list(numpy.linspace(0.0003, learning_rate, num=10)) # warmup
---
> learning_rates = list(numpy.linspace(0.0001, learning_rate, num=20)) # warmup
> min_learning_rate = learning_rate / 50.
For reference, again, I think this is the new config, right? And this is the old config, I guess.
Unfortunately, I see no improvement after fixing my mean std_dev bug either. Please find below a diff, red is
old
and green isnew
config (after removing the functions for specAug fromnew
)
I'm stripping that a bit down to relevant parts.
56a58 > 'use_new_filter': True,
You might play around with this and related settings (epoch_wise_filter
in the dataset). This is basically the curriculum learning, where the idea is that you only use the clean (simpler) and shorter sequences initially in training.
Did you removed parts of the diff here? Because otherwise this is wrong. E.g. max_mean_len
is different, and there are also several steps now. Please double check. This is very important.
It follows the diff in the network. I.e. these changes:
And then:
accum_grad_multiple_step = 2
min_learning_rate
(probably minor effect)I was surprised to see this morning that even though there was not much improvement until epoch 36, the training error suddenly dropped after that epoch 37. And it seems to be decreasing as expected. train-scores.data.txt
So I am assuming the biggest problem was my mistake of using mean file as std_dev.
Did you removed parts of the diff here? Because otherwise this is wrong. E.g.
max_mean_len
is different, and there are also several steps now. Please double check. This is very important.
In order to make the fewest possible changes to the config file, I removed the other steps. I will put the other steps back. But I think that since the network seems to be learning well, I will put the other steps in and retrain with the latest config file.
Many thanks for your help.
I would recommend that in the epoch_wise_filter
, you also try with the multiple steps, and basically the settings like the new config I linked. In my experiments on this, playing around with this was very fragile and had a huge effect on the final performance, and also how fast it would converge, or if it would converge at all, etc. So if it only goes down in epoch 37, this does not sound optimal for me. It should go down much sooner.
Yes, thanks! Now I am using the epoch_wise_filter
exactly as in the config file you linked
d["epoch_wise_filter"] = {
(1, 5): {
'use_new_filter': True,
'max_mean_len': 50, # chars
'subdirs': ['train-clean-100', 'train-clean-360']},
(5, 10): {
'use_new_filter': True,
'max_mean_len': 150, # chars
'subdirs': ['train-clean-100', 'train-clean-360']},
(11, 20): {
'use_new_filter': True,
'subdirs': ['train-clean-100', 'train-clean-360']},
}
Many thanks for the the specAugment implementation.
The base2.conv2l.specaug.curric3 here, is my understanding correct that it is not a continued training from anything?
Also, is there significant difference in training time in bas2.conv2l.specaug.curric3 as compared to let's say base2?