Closed Cwiiis closed 7 years ago
A few statements:
txt_files
is not ordered by wav file size, but _files_circular_list
isA few questions:
alpha_cb()
? (The space of possible functions is infinite dimensional.)I guess I have a few more comments, but I guess that's enough to get things started.
Ah, thanks for that, I thought txt_files was already sorted - that's good to know :) I would say this function only gets applied to the training set, to not affect validation/test WER. This shouldn't affect the 'correctness' of the test WER I suppose, in that it affects what the test set is (but maybe that should be considered wrong - I'm not sure really) I don't know that this will improve anything, part of this would be running tests with different parameters to see what affect it has. Maybe it's a dead-end (at least how I envisage it), but I think it's worth finding out. Not sure how to choose alpha_cb, I'm going with linear scale, with 2.0 at one end to begin with, but I think a graph of its affect against loss over X epochs might be interesting. Might be worth trying different levels of exponential drop-off in the curve. I'm not familiar with AdaBoost, I'll have a look - thanks!
It won't affect validation and test WER, but while training we compare training and validation WER. It will affect this comparison. Also the reporting we currently have for training WER reported. This will cause the older results not to be comparable to these results as they will be on different data sets.
To do an apples to apples comparison what might be more apropos is to use such a trick for training, then when reporting the WER for the training data set calculate the WER with the unaltered training set.
In looking at AdaBoost don't think too much about the weak learners, think more about the data re-weighting.
I'm not sure that makes sense - yes, you won't be able to compare weighted training WER with unweighted, but the only reason to look at training WER would be to see the effect of the training dataset and to check for over-fitting, so it makes sense to use the exact same dataset as was used in training when calculating the WER (at least, I think), otherwise you'll probably get artificially lower WER.
The comparison of validation and training WER will be affected, but I don't think the meaning of it will be affected. This does mean that old training WER and weighted training WER don't measure the exact same thing, but they are still comparable factors in the "this is how effective training has been in learning the training dataset" measure.
There may be a better way to weight it without altering the dataset though that would address these concerns. I'm doing the easiest/most obvious thing first to discover if it's likely to yield positive results. Reading about AdaBoost now.
One of the reasons to look at training and validation WER while training is to look for overfitting. With a skewed training set the training WER is not comparable with the validation WER and thus overfitting can not be measured while training.
As an example, say we have 1 billion training examples we weight example 0 with weight 1 and all others with weight 0. Our model learns perfectly example 0 and our training WER is 0.0. We run the same model over our validation set and unsurprisingly get WER 1.0.
Naively, looking at only the WER's, one would think the model is overfitting the training set as the training WER is 0.0. However, because of the weighting, this is false. It's overfitting only a small subset of the training set.
Thus, as this simply example illustrates, the WER for a weighted training set can't be used for checking overfitting.
To get a meaningful result one has to calculate the training WER using a unweighted training set.
In that naive example we have overfit though - With those weights, the training model has become as closely fit as it will ever be. Let's say you have 100 samples and you weight the first 50 samples at 2 and the next 50 at 1 - if it learns the first 50 perfectly and doesn't learn the next 50, you'll end up with a WER of 0.66 with the weighted training set, or 0.5 if you unweighted it. Contrary to what you suggest, if you calculated the WER on the unweighted training set, you're likely to under-report WER and end up not catching overfitting. In your example, we would never over-fit.
In your example, we would never over-fit.
Yes, that's part of my point. The weighting skews the WER value for the weighted training data set, and makes that WER useless for comparisons.
You have to draw iid from the training data set and calculate the WER from such a sample or your sample is biased and will lead to incorrect conclusions.
To understand a bit more about this maybe you can read a bit about sampling bias and in particular selection bias.
I kind of meant in your example, if we do what you suggest, we would never over-fit. Does it still hold what you're saying? I do understand what sampling and selection bias are :) I thought the idea of weighting (in this case) was indeed to add some small bias.
Right, in my example we would never overfit. But, we would never know this if we used your suggested weighted WER calculation as the WER calculation using the weighted training data would be 0.0
The idea of weighting is to add bias in training not in the evaluation of the WER on the training set.
Also, other than "it seems like a good idea" you have to justify why you weight some samples more than others. As the space of weighting functions is infinite dimensional, guessing functions without a theoretical bases isn't really an efficient method of proceeding.
If you were to create some algorithm, like AdaBoost, that weighted "failed" examples more, that would likely have a much higher chance of working.
Agreed about the 'it seems like a good idea' as justification :) I'm having a look at doing something on the lines of AdaBoost, though of course we'll still need to verify its efficacy.
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
It may be advantageous to weight training data so that shorter samples get seen more frequently. This may allow us progress faster during early stages of training.
I think the easiest way of doing this would be to alter the list of text files in the importers so that the different weights are represented by how often you see samples in the dataset. I'm uncertain of what the best distribution curve would be, so I suggest that we allow defining a distribution function (with the default being a linear distribution).
Let's assume
txt_files
is the list of text files (from which the wave files is derived), andalpha_cb
is a callback that takes a float representing an iterator's progress throughtxt_files
and returns a float that represents the distribution at that point in the graph (so at the short end, where the input would be0
, the output may be2
, meaning that a sample should appear twice in the list, and at the long end, where the input would be1
, the output may be1
meaning the sample should appear once in the list). Some pseudo-code: