mozilla / DeepSpeech

DeepSpeech is an open source embedded (offline, on-device) speech-to-text engine which can run in real time on devices ranging from a Raspberry Pi 4 to high power GPU servers.
Mozilla Public License 2.0
25.35k stars 3.97k forks source link

Allow weighting training data #252

Closed Cwiiis closed 7 years ago

Cwiiis commented 7 years ago

It may be advantageous to weight training data so that shorter samples get seen more frequently. This may allow us progress faster during early stages of training.

I think the easiest way of doing this would be to alter the list of text files in the importers so that the different weights are represented by how often you see samples in the dataset. I'm uncertain of what the best distribution curve would be, so I suggest that we allow defining a distribution function (with the default being a linear distribution).

Let's assume txt_files is the list of text files (from which the wave files is derived), and alpha_cb is a callback that takes a float representing an iterator's progress through txt_files and returns a float that represents the distribution at that point in the graph (so at the short end, where the input would be 0, the output may be 2, meaning that a sample should appear twice in the list, and at the long end, where the input would be 1, the output may be 1 meaning the sample should appear once in the list). Some pseudo-code:

new_txt_files = []
n_files = len(txt_files)
error = 0
for i in range(n_files):
  error += alpha_cb(i / float(n_files))
  while error >= 1:
    new_txt_files.extend(txt_files[i])
    error -= 1
kdavis-mozilla commented 7 years ago

A few statements:

A few questions:

I guess I have a few more comments, but I guess that's enough to get things started.

Cwiiis commented 7 years ago

Ah, thanks for that, I thought txt_files was already sorted - that's good to know :) I would say this function only gets applied to the training set, to not affect validation/test WER. This shouldn't affect the 'correctness' of the test WER I suppose, in that it affects what the test set is (but maybe that should be considered wrong - I'm not sure really) I don't know that this will improve anything, part of this would be running tests with different parameters to see what affect it has. Maybe it's a dead-end (at least how I envisage it), but I think it's worth finding out. Not sure how to choose alpha_cb, I'm going with linear scale, with 2.0 at one end to begin with, but I think a graph of its affect against loss over X epochs might be interesting. Might be worth trying different levels of exponential drop-off in the curve. I'm not familiar with AdaBoost, I'll have a look - thanks!

kdavis-mozilla commented 7 years ago

It won't affect validation and test WER, but while training we compare training and validation WER. It will affect this comparison. Also the reporting we currently have for training WER reported. This will cause the older results not to be comparable to these results as they will be on different data sets.

To do an apples to apples comparison what might be more apropos is to use such a trick for training, then when reporting the WER for the training data set calculate the WER with the unaltered training set.

In looking at AdaBoost don't think too much about the weak learners, think more about the data re-weighting.

Cwiiis commented 7 years ago

I'm not sure that makes sense - yes, you won't be able to compare weighted training WER with unweighted, but the only reason to look at training WER would be to see the effect of the training dataset and to check for over-fitting, so it makes sense to use the exact same dataset as was used in training when calculating the WER (at least, I think), otherwise you'll probably get artificially lower WER.

The comparison of validation and training WER will be affected, but I don't think the meaning of it will be affected. This does mean that old training WER and weighted training WER don't measure the exact same thing, but they are still comparable factors in the "this is how effective training has been in learning the training dataset" measure.

There may be a better way to weight it without altering the dataset though that would address these concerns. I'm doing the easiest/most obvious thing first to discover if it's likely to yield positive results. Reading about AdaBoost now.

kdavis-mozilla commented 7 years ago

One of the reasons to look at training and validation WER while training is to look for overfitting. With a skewed training set the training WER is not comparable with the validation WER and thus overfitting can not be measured while training.

As an example, say we have 1 billion training examples we weight example 0 with weight 1 and all others with weight 0. Our model learns perfectly example 0 and our training WER is 0.0. We run the same model over our validation set and unsurprisingly get WER 1.0.

Naively, looking at only the WER's, one would think the model is overfitting the training set as the training WER is 0.0. However, because of the weighting, this is false. It's overfitting only a small subset of the training set.

Thus, as this simply example illustrates, the WER for a weighted training set can't be used for checking overfitting.

To get a meaningful result one has to calculate the training WER using a unweighted training set.

Cwiiis commented 7 years ago

In that naive example we have overfit though - With those weights, the training model has become as closely fit as it will ever be. Let's say you have 100 samples and you weight the first 50 samples at 2 and the next 50 at 1 - if it learns the first 50 perfectly and doesn't learn the next 50, you'll end up with a WER of 0.66 with the weighted training set, or 0.5 if you unweighted it. Contrary to what you suggest, if you calculated the WER on the unweighted training set, you're likely to under-report WER and end up not catching overfitting. In your example, we would never over-fit.

kdavis-mozilla commented 7 years ago

In your example, we would never over-fit.

Yes, that's part of my point. The weighting skews the WER value for the weighted training data set, and makes that WER useless for comparisons.

You have to draw iid from the training data set and calculate the WER from such a sample or your sample is biased and will lead to incorrect conclusions.

To understand a bit more about this maybe you can read a bit about sampling bias and in particular selection bias.

Cwiiis commented 7 years ago

I kind of meant in your example, if we do what you suggest, we would never over-fit. Does it still hold what you're saying? I do understand what sampling and selection bias are :) I thought the idea of weighting (in this case) was indeed to add some small bias.

kdavis-mozilla commented 7 years ago

Right, in my example we would never overfit. But, we would never know this if we used your suggested weighted WER calculation as the WER calculation using the weighted training data would be 0.0

The idea of weighting is to add bias in training not in the evaluation of the WER on the training set.

kdavis-mozilla commented 7 years ago

Also, other than "it seems like a good idea" you have to justify why you weight some samples more than others. As the space of weighting functions is infinite dimensional, guessing functions without a theoretical bases isn't really an efficient method of proceeding.

If you were to create some algorithm, like AdaBoost, that weighted "failed" examples more, that would likely have a much higher chance of working.

Cwiiis commented 7 years ago

Agreed about the 'it seems like a good idea' as justification :) I'm having a look at doing something on the lines of AdaBoost, though of course we'll still need to verify its efficacy.

lock[bot] commented 5 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.