Integrate myrtle's RNN-T model code

samgd commented 4 years ago

[x] Install warp-transducer in Dockerfile - see https://github.com/ryanleary/mlperf-rnnt-ref/issues/1#issuecomment-559186425.
[x] Correct forward and decoder interfaces.
[x] Change frame splicing to 3 from 1.
[x] Overfit to single dev-clean batch.
[x] Overfit to full dev-clean.
[x] Replicate existing training run results.

mwawrzos commented 4 years ago

I have added the warp-transducer to the Dockerfile: https://github.com/ryanleary/mlperf-rnnt-ref/commit/99e507b5b5c616158d85bb8bc8173decc0e05d94

samgd commented 4 years ago

Status update: built docker container, now pre-preprocessing LibriSpeech to wavs (it filled the disk last night so restarting).

Once pre-processed will attempt to overfit on dev-clean.

Let me know if you're working on this and changing the interfaces @mwawrzos

samgd commented 4 years ago

Have hooked up everything however the loss function is unable to import the GPU version. Can be replicated by building the container and running:

import torch
from warprnnt_pytorch import RNNTLoss
rnnt_loss = RNNTLoss()
cuda = True
acts = torch.FloatTensor([[[[0.1, 0.6, 0.1, 0.1, 0.1],
                            [0.1, 0.1, 0.6, 0.1, 0.1],
                            [0.1, 0.1, 0.2, 0.8, 0.1]],
                            [[0.1, 0.6, 0.1, 0.1, 0.1],
                            [0.1, 0.1, 0.2, 0.1, 0.1],
                            [0.7, 0.1, 0.2, 0.1, 0.1]]]])
labels = torch.IntTensor([[1, 2]])
act_length = torch.IntTensor([2])
label_length = torch.IntTensor([2])
acts = torch.autograd.Variable(acts, requires_grad=True)
labels = torch.autograd.Variable(labels)
act_length = torch.autograd.Variable(act_length)
label_length = torch.autograd.Variable(label_length)
if cuda: 
    acts = acts.cuda()
    labels = labels.cuda()
    act_length = act_length.cuda()
    label_length = label_length.cuda()

loss = rnnt_loss(acts, labels, act_length, label_length)
loss.backward()

samgd commented 4 years ago

Installing whilst inside a running container fixes the issue. Need to find a way to do it during docker build!

samgd commented 4 years ago

Update: training loop fixed up, fixing up eval() function.

samgd commented 4 years ago

Update: Training and eval loops fixed up, script/train.sh now running. Attempting to overfit to dev-clean.

Major change log: Frame splicing now downsamples by frame_splicing as well as stacking frame_splicing adjacent frames. Downsampling across the time domain is a necessity due to the large activation that gets fed to the loss function.

samgd commented 4 years ago

Changes pushed. ~70s/epoch so leaving the dev-clean overfit run overnight, will update with results tomorrow and pickup any changes you've made.

samgd commented 4 years ago

Full dev-clean overfit run using the existing training setup and hyperparameters failed. However, changing the settings/hyperparameters back to reflect a simpler training setup succeeds: downsampling by a factor of 2 rather than 3, no weight decay, no spec augment, a reduced and constant lr (working at batch 8 so lr should probably be reduced anyway), and single dev-clean batch (8 samples).

Now changing back to the whole of dev-clean, will update with results.

Assuming that works an initial training run can be done on the actual training set and hopefully reproduce the existing baseline. From there we're in a good spot to begin iterating!

samgd commented 4 years ago

Full dev-clean overfit appears to be working:

Prediction: bartley leaned his head in his hands there it is them
Reference: bartley leaned his head in his hands and spoke through his teeth
Loss@Step: 40600  ::::::: 10.253323554992676
Step time: 2.5285778045654297 seconds
training_batch_WER: 0.25984251968503935
Prediction: in the second i tel what is understod of her on earth here my lady is so isn
Reference: in the second i tell what is understood of her on earth here my lady is desired
Loss@Step: 40625  ::::::: 7.737914085388184
Step time: 2.0687413215637207 seconds
training_batch_WER: 0.2934131736526946
Prediction: it's surely a turble storm outside said the merchant's eldest daughter as the wind ratled the tiles of the rof and the rain beat in torents and to
Reference: it's surely a terrible storm outside said the merchant's eldest daughter as the wind rattled the tiles of the roof and the rain beat in torrents against the doors and windows
Loss@Step: 40650  ::::::: 12.093977928161621
Step time: 2.5042574405670166 seconds
training_batch_WER: 0.31718061674008813
Prediction: yet as the profound doctor had ben terified at his own rashnes apolinaris was heard to muter some faint acents of excuse and explanation
Reference: yet as the profound doctor had been terrified at his own rashness apollinaris was heard to mutter some faint accents of excuse and explanation
Loss@Step: 40675  ::::::: 18.001619338989258
Step time: 3.5752782821655273 seconds
Finished epoch 125 in 78.1238625049591
Starting epoch 126, step 40698

Starting a run on train-* shortly.

mwawrzos commented 4 years ago

I run the default config on 8 GPUs and in my case overfit did not appear. Unfortunately, I have not run it on single-GPU, so I can't be sure if my config is wrong or the multi-GPU setup.

I also run training on train-clean-100-wav - it runs out of memory. It would not converge, as the overfitting have not appeared, but I have not expected extra memory requirements for a larger dataset.

samgd commented 4 years ago

OK - will push my changes shortly.

Current training log file: https://f000.backblazeb2.com/file/samgd-shared/rnnt_train_fp16_gbs8.191128103620.log

mwawrzos commented 4 years ago

@samgd, you said, that eval is fixed up. Do you have it internally, or I can work on it tomorrow?

samgd commented 4 years ago

Changes pushed here:

https://github.com/ryanleary/mlperf-rnnt-ref/tree/integration_myrtle

Diff the changes, many are crucial: downsampling input steps for memory, learning rate due to batch size

You'll probably want to set encoder_rnn_layers=2

samgd commented 4 years ago

Also can you help with this?

https://github.com/NVIDIA/DeepLearningExamples/issues/322

I'm confused as to what splice_frames was actually meant to do

mwawrzos commented 4 years ago

Thanks for sharing your code. I passed the wrong value to the decode function. That is the reason I had 100% WER all the time. Especially, that the loss is on a similar level.

samgd commented 4 years ago

Got it. Let me know how your experiments go, it would be good to start scaling up as soon as possible!

samgd commented 4 years ago

The log file above was using the old incorrect splicing so half of the input frames were missing. Seeing better results already on a new training run before one epoch has even finished.

samgd commented 4 years ago

Results from overnight (for future reference, these WERs include the bug whereby duplicate symbols were still being merged ala CTC):

==========>>>>>>Evaluation WER: 0.3795632513510533
==========>>>>>>Evaluation WER: 0.3218631667953384
==========>>>>>>Evaluation WER: 0.29259953678173595
==========>>>>>>Evaluation WER: 0.27934634756075144
==========>>>>>>Evaluation WER: 0.26953053196573656
==========>>>>>>Evaluation WER: 0.263041799933826
==========>>>>>>Evaluation WER: 0.25570751075328113
==========>>>>>>Evaluation WER: 0.2537039079445609
==========>>>>>>Evaluation WER: 0.24745413771552516

This is inline with the previous training run from the old repo (although it's less than a day so maybe too early to say). Will leave running but will moving on to other tasks

mwawrzos commented 4 years ago

I am preparing an environment for param search (scripts, etc.). If you think, that something needs to be done first, please, let me know.

samgd commented 4 years ago

What parameters are you planning on searching over?

mwawrzos commented 4 years ago

Currently, I am waiting for reproduction experiments to complete. In the meantime, I set the tools to be ready for the next tasks.

I was considering the following experiments:

grid search over model sizes (batch size=1) to find the max model that fits into the memory
random search on the found models across:
- dropout
- learning rate
- gradient accumulation steps (maybe - to simulate bigger batches)

I am assuming that the input params, like a count of features, should not be changed.

However, before getting into the search, I would like to discuss the shape of the network. For example, the encoder net has nn.Linears around the LSTMs stack. I don't see them in the paper, but you may have arguments for that. But it is for a later discussion, for now, the model should stay unchanged until we will replicate the previous results on this codebase.

samgd commented 4 years ago

OK, post results from the current reproduction experiments when they're ready as I'm interested to see :-) (WER will need recomputing, see below)

My view is that it's a little early to do a grid search and that the focus should be on driving the architecture to being nearer to its final form. For instance, I'm now able to increase the batch size by x2 by down sampling after a second LSTM layer. However, if there is spare compute time then more statistics could be useful.

Will push more code soon, still iterating on it, but this commit will be useful to cherry-pick in before running the experiments:

https://github.com/ryanleary/mlperf-rnnt-ref/commit/4127f6305ad38d0674fd37184563c111ffa8fcca

mwawrzos commented 4 years ago

Good point! I will update my code with your fix.

samgd commented 4 years ago

Replication run OK.

Commit: https://github.com/ryanleary/mlperf-rnnt-ref/commit/5bf5903a808e0de308c5417a660aa8992e5e5cce

Logs and state dicts: https://myrtle-public-models.s3-us-west-2.amazonaws.com/rnnt_train_fp16_gbs8.191129174845.tar.gz

Results summary:

==========>>>>>>Evaluation WER: 0.3166795338406676
==========>>>>>>Evaluation WER: 0.25206793867872507
==========>>>>>>Evaluation WER: 0.22184846145362302
==========>>>>>>Evaluation WER: 0.20771295172971582
==========>>>>>>Evaluation WER: 0.19718025072607626
==========>>>>>>Evaluation WER: 0.18798941215396492
==========>>>>>>Evaluation WER: 0.18273225249071726
==========>>>>>>Evaluation WER: 0.17488327635013418
==========>>>>>>Evaluation WER: 0.1731921620528657
==========>>>>>>Evaluation WER: 0.16966288004117497
==========>>>>>>Evaluation WER: 0.16519613249512885
==========>>>>>>Evaluation WER: 0.16696077350097424
==========>>>>>>Evaluation WER: 0.1635969265835815
==========>>>>>>Evaluation WER: 0.15708981287452667

Note that the previous run took ~50 epochs to get to ~13% WER. A small-width beam search dropped that ~2% to ~11% WER so although this is higher it is consistent.

mwawrzos commented 4 years ago

Batch size = 8 was not fitting into memory on 16GB V100. I have results for BS=4 instead. The model seems to be more sensitive to LR than I expected. I tried two LR - 3e-3 * numGPUS and 1.5e-4 * numGPUS. I tried both, Adam and Novograd optimizers. I made experiments for overfitting, training on small 100h DS and on a full DS. Cases not mentioned have finished with OOM or WER close to 100%, or it is still too early to judge.

I am attaching logs, so you can check if my config is in sync with yours.

Here are my best results:

GPUs	dataset	train WER	eval WER	LR	optimizer	epoch	logs
8	dev-clean	7.04%	11.51%	0.0024	novograd	2999	1.log
8	dev-clean	49.30%	15.48%	0.0012	novograd	2999	2.log
1	dev-clean	3.17%	7.95%	0.00015	adam	159	3.log
1	dev-clean	68.57%	68.76%	0.0003	adam	206	4.log
8	train-clean-100	27.55%	34.01%	0.0012	novograd	302	5.log
8	full	65.49%	40.83%	0.0012	novograd	31	6.log
1	train-clean-100	27.01%	32.80%	0.00015	adam	15	7.log
1	full	35.17%	23.06%	0.00015	adam	1	8.log

Note, that not all results are for the same epoch. I am posting the last available

ryanleary / mlperf-rnnt-ref

Integrate myrtle's RNN-T model code #1