mozilla / DeepSpeech

DeepSpeech is an open source embedded (offline, on-device) speech-to-text engine which can run in real time on devices ranging from a Raspberry Pi 4 to high power GPU servers.
Mozilla Public License 2.0
25.24k stars 3.96k forks source link

Training performance drop after augmentation refactoring #3092

Open DanBmh opened 4 years ago

DanBmh commented 4 years ago

I have the feeling that the training performance, duration and accuracy, got worse after the augmentation refactoring commits.

Before the training of my dataset took about 2:10h and today it took 3:20h, with about the same number of epochs. At the beginning one epoch takes about 5min, but after epoch 18 they suddenly need 8min on average. I didn't see this behaviour in the trainings two days ago.

Also the accuracy got a bit worse:

Dataset Additional Infos Losses Training epochs of best model Result
Voxforge Test: 32.844025, Validation: 36.912005 14 WER: 0.240091, CER: 0.087971
Voxforge without _freq_and_timemasking augmentation Test: 33.698494, Validation: 38.071722 10 WER: 0.244600, CER: 0.094577
Voxforge using new audio augmentation options (AUG_AUDIO code1) Test: 29.280865, Validation: 33.294815 21 WER: 0.220538, CER: 0.079463
Voxforge after refactoring Test: 33.317413, Validation: 38.678969 20 WER: 0.243480, CER: 0.088640

This were the options I set before:

AUG_PITCH_TEMPO="--augmentation_pitch_and_tempo_scaling \
                   --augmentation_pitch_and_tempo_scaling_min_pitch 0.98 \
                   --augmentation_pitch_and_tempo_scaling_max_pitch 1.1 \
                   --augmentation_pitch_and_tempo_scaling_max_tempo 1.2"
AUG_ADD_DROP="--data_aug_features_additive 0.2 \
                --augmentation_spec_dropout_keeprate 0.95"
AUG_FREQ_TIME="--augmentation_freq_and_time_masking True"
AUG_AUDIO="--augment reverb[p=0.1,delay=50.0~30.0,decay=10.0:2.0~1.0] \
      --augment gaps[p=0.05,n=1:3~2,size=10:100] \
      --augment resample[p=0.1,rate=12000:8000~4000] \
      --augment codec[p=0.1,bitrate=48000:16000] \
      --augment volume[p=0.1,dbfs=-10:-40]"

And those I used for my todays run:

    AUG_AUDIO="--augment volume[p=0.1,dbfs=-10:-40] \
      --augment pitch[p=0.1,pitch=1.1~0.95] \
      --augment tempo[p=0.1,factor=1.25~0.75]"
    AUG_ADD_DROP="--augment dropout[p=0.1,rate=0.05] \
      --augment add[p=0.1,domain=signal,stddev=0~0.5]"
    AUG_FREQ_TIME="--augment frequency_mask[p=0.1,n=1:3,size=1:5] \
      --augment time_mask[p=0.1,domain=signal,n=3:10~2,size=50:100~40]"
    AUG_EXTRA="--augment reverb[p=0.1,delay=50.0~30.0,decay=10.0:2.0~1.0] \
      --augment resample[p=0.1,rate=12000:8000~4000] \
      --augment codec[p=0.1,bitrate=48000:16000]"

I also had to reduce the batch size from 30 to 24 because I got the error #3088. About two month ago I could use 36 without any problems.

I know there is a bit randomness in the accuracy and I did change some of the augmentation params slightly but the change in results is bigger than expected.

@tilmankamp do you have an idea about this?

DanBmh commented 4 years ago

I did run another test with the code directly before the refactoring (#188a6f2c1ee53dc79acf8abceaf729b5f9a05e7a).

This time one epoch takes 4min on average and the whole training took 1:45h. Dataset Additional Infos Losses Training epochs of best model Result
Voxforge Test: 28.846869, Validation: 32.680268 16 WER: 0.225360, CER: 0.083504

I now used batch size of 24 and did update the params again to better match the params above:

  AUG_AUDIO="--augmentation_pitch_and_tempo_scaling \
                   --augmentation_pitch_and_tempo_scaling_min_pitch 0.95 \
                   --augmentation_pitch_and_tempo_scaling_max_pitch 1.1 \
                   --augmentation_pitch_and_tempo_scaling_max_tempo 1.25"
  AUG_ADD_DROP="--data_aug_features_additive 0.25 \
                --augmentation_spec_dropout_keeprate 0.95"
  AUG_FREQ_TIME="--augmentation_freq_and_time_masking True"
  AUG_EXTRA="--augment reverb[p=0.1,delay=50.0~30.0,decay=10.0:2.0~1.0] \
      --augment gaps[p=0.05,n=1:3~2,size=10:100] \
      --augment resample[p=0.1,rate=12000:8000~4000] \
      --augment codec[p=0.1,bitrate=48000:16000] \
      --augment volume[p=0.1,dbfs=-10:-40]"
tilmankamp commented 4 years ago

@DanBmh Augmentations volume gaps reverb codec resample and overlay are most likely not responsible for this discrepancy as their implementations have not been changed during refactoring. For the others it'd be helpful to compare them one by one with their former implementations to get a better understanding of the problem. I'll do some performance tests here.

tilmankamp commented 4 years ago

My observations so far:

DanBmh commented 4 years ago

I had some time to run some more tests today (with master about two days ago).

This time an epoch did take about 4:30min on average. I also tried different dropout values:

DanBmh commented 4 years ago

@tilmankamp Any updates on the accuracy problem?

JRMeyer commented 3 years ago

@DanBmh -- did you ever reach a conclusion on this? have you been running augmentation with newer releases?

DanBmh commented 3 years ago

Were there important changes to the augmentations in between? I didn't check for it.

I didn't run further tests, just the ones above. For my own trainings I still use the old version.

DanBmh commented 3 years ago

Might have found a reason for the accuracy problem. First I did misunderstand the augmentation flag description and the pitch and tempo flags are not converted correctly. Second, the new start:stop logic could be another reason. I normally use a high training epoch number like 1000, because the training is stopped with early-stopping. But I assume that the :stop is related to the epochs flag and I'm therefore using only the start values for the augmentations instead of the full range.

Will try to run a test in the next time, but I don't believe this will also solve the slower training.

For the second problem maybe a new flag like augment_growth_epochs could be helpful for better combination with early-stopping.

reuben commented 3 years ago

For the second problem maybe a new flag like augment_growth_epochs could be helpful for better combination with early-stopping.

Yeah, that could be useful, usually for hyperparameter schedules there's a separate start/ramp-up/ramp-down/stop range compared to the number of steps/epochs for the whole training run.