Issues in reproducibility of results in Kaldi

vijayaditya commented 8 years ago

There are several reasons for noise in results across different runs. It would be good if we are able to control the randomness in the experiment with a single random seed. In case this is not possible we should at least list the possible reasons for noise in the results.

[This is part of a wishlist from a recent discussion with a corporate partner]

danpovey commented 8 years ago

As far as I'm aware, if you run with the same number of jobs things and on the same hardware and OS version, things should always be deterministic until you get to neural net training. C++ rand() is deterministic by default; perl is not, but we set the random seed in the programs that use random numbers, to make it consistent.

Dan

On Tue, Apr 19, 2016 at 2:25 PM, Vijayaditya Peddinti < notifications@github.com> wrote:

There are several reasons for noise in results across different runs. It would be good if we are able to control the randomness in the experiment with a single random seed. In case this is not possible we should at least the possible reasons for noise in the results.

[This is part of a wishlist from a recent discussion with a corporate partner]

— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub https://github.com/kaldi-asr/kaldi/issues/717

sikoried commented 8 years ago

There's two things that come to mind:

Dithering on feature extraction; depends on number of jobs, and which machines/CPUs it gets allocated to, and other factors such as IO load etc. which can result in different sequences of calling rand(). While an explicit --seed=0 would help to some extent, you'll still depend on number of jobs.
Feature compression; I still haven't fully understood how this actually works, but it seems to drop regions of the floating point representation, which will depend on your glibc and your optimization level at compile time. Those can be held constant, but I guess if you chose lossy compression you'd be OK with small variance in reproducability.

danpovey commented 8 years ago

We could start adding --dither=0.0 --energy-floor=1.0 in the feature extraction configs, that would remove dithering from the equation. The random initial alignments may also make a difference. align-equal also calls rand(). Neural net training will never be deterministic due to nondeterminism in CUBLAS. Dan

On Tue, Apr 19, 2016 at 2:55 PM, Korbinian notifications@github.com wrote:

There's two things that come to mind:

Dithering on feature extraction; depends on number of jobs, and which machines/CPUs it gets allocated to, and other factors such as IO load etc. which can result in different sequences of calling rand(). While an explicit --seed=0 would help to some extent, you'll still depend on number of jobs.

Feature compression; I still haven't fully understood how this actually works, but it seems to drop regions of the floating point representation, which will depend on your glibc and your optimization level at compile time. Those can be held constant, but I guess if you chose lossy compression you'd be OK with small variance in reproducability.

— You are receiving this because you commented. Reply to this email directly or view it on GitHub https://github.com/kaldi-asr/kaldi/issues/717#issuecomment-212145943

galv commented 8 years ago

For anyone unaware, the nondeterminism in various CUDA libraries is due to an interactions between atomics and floating point. Floating point is not associative, and the order in which processors will grab locks is not deterministic; this causes floating point operations to happen in different orders.

Memory can also have errors (bits flipping), but turning on error correcting codes can mitigate this.

sikoried commented 8 years ago

I see this whole issue with two hats: As a researcher, I'd love a --strict flag on compilation/binaries/scripts that ensures me that the results will be 100% reproducible (to the extent that @galv asserts). As an engineer building real world systems, I don't really care about half a percent of improvement/degradation on an eval set, since the real world is so variable anyways.

I don't think that --dither=0 --energy-floor=1 is a particularly good idea-- looking back, zero variances (even despite flooring) is rarely a good idea.

vince62s commented 8 years ago

I have two questions on this topic:

what is the typical variation you observed in results (absolute) after various neural nets training ?
do we know if the non-determinism stuff of the Cublas library impacts "after" a certain number of significant digits, or is this just impossible to measure ?

danpovey commented 8 years ago

Maybe 0.1 to 0.2% stddev, but with BLSTMs it can be more, sometimes like 0.3% or 0.4%, IIRC.

Definitely the CUBLAS nondeterminism will be small- the question is how much it affects the trajectory of the model in the long term.

On Thu, Sep 15, 2016 at 12:35 PM, vince62s notifications@github.com wrote:

I have two questions on this topic:

-

what is the typical variation you observed in results (absolute) after various neural nets training ?

do we know if the non-determinism stuff of the Cublas library impacts "after" a certain number of significant digits, or is this just impossible to measure ?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/kaldi-asr/kaldi/issues/717#issuecomment-247429182, or mute the thread https://github.com/notifications/unsubscribe-auth/ADJVu7ulWJddheUgIaEhUcmjGnqEhADWks5qqZ4FgaJpZM4ILIo0 .

qbolec commented 7 years ago

Context: in my company, we run regression tests for our models and software, which basically test for bit-by-bit equality of output with expected output, and found out that users runinng tests on the same machine, same compilation of Kaldi, get different lat.1.gz files (after unziping and converting to text).

I've narrowed down the problem to steps/decode_fmllr.sh > gmm-est-fmllr-gpost > src/gmmbin/gmm-est-fmllr-gpost.cc > AccumulateFromPosteriors > transform/fmllr-diag-gmm.cc > void FmllrDiagGmmAccs:: AccumulateFromPosteriors > stats.a.AddMatVec(1.0, pdf.means_invvars(), kTrans, posterior, 1.0); > src/matrix/kaldi-vector.cc > void VectorBase<Real>::AddMatVec > cblas_Xgemv(trans, M.NumRows(), M.NumCols(), alpha, M.Data(), M.Stride(), v.Data(), 1, beta, data_, 1); > matrix/cblas-wrappers.h > inline void cblas_Xgemv > cblas_sgemv(CblasRowMajor, static_cast<CBLAS_TRANSPOSE>(trans), num_rows, num_cols, alpha, Mdata, stride, xdata, incX, beta, ydata, incY)

We use ATLAS 3.8.3 and this function seems to be nondeterministic. I'm not sure why. But I can demonstrate it gives different results depending on such "unrelated" things as wether the input to gmm-est-fmllr-gpost was fed by cat input |... or ...< input. I really hope this is not a buffer overrun issue, and bet for timing issues combined with associativity of floating point arthmetic.

I've patched the implementation of AddMatVec with a naive implementation:

if (trans==kTrans && alpha == 1 && beta == 1){
    const Real * v_Data = v.Data();
    const Real * M_Data = M.Data();
    MatrixIndexT M_Stride = M.Stride();
    for (MatrixIndexT j = 0; j < v.dim_ ; ++j) {
       for (MatrixIndexT i = 0; i < dim_; ++i) {
         data_[i] += v_Data[j] * M_Data[j*M_Stride+i];
       }
    }
    return;
  }

and that seems to make gmm-est-fmllr-gpost results deterministic, similar (up to 6th decimal place at least) with the original code, and only slightly slower than original (1.5s vs .1.2s)

I'd love to know: 1) is the above code snippet correct? 2) is this some known issue with ATLAS 3.8.3? 3) Can I force ATLAS to be deterministic?

danpovey commented 7 years ago

If your code was wrong the automatic tests ('make test') would have failed big time. We generally don't expect bit-for-bit equality (e.g. we run on heterogeneous hardware) so we don't keep track of this. Certainly if you use multi-threaded ATLAS it would be nondeterministic. If single threaded... not sure. But again, we don't check for this.

Dan

On Tue, Nov 22, 2016 at 12:10 PM, qbolec notifications@github.com wrote:

Context: in my company, we run regression tests for our models and software, which basically test for bit-by-bit equality of output with expected output, and found out that users runinng tests on the same machine, same compilation of Kaldi, get different lat.1.gz files (after unziping and converting to text).

I've narrowed down the problem to steps/decode_fmllr.sh > gmm-est-fmllr-gpost > src/gmmbin/gmm-est-fmllr- gpost.cc > AccumulateFromPosteriors > transform/fmllr-diag-gmm.cc > void FmllrDiagGmmAccs:: AccumulateFromPosteriors > stats.a.AddMatVec(1.0, pdf.means_invvars(), kTrans, posterior, 1.0); > src/matrix/kaldi-vector.cc

void VectorBase::AddMatVec > cblasXgemv(trans, M.NumRows(), M.NumCols(), alpha, M.Data(), M.Stride(), v.Data(), 1, beta, data, 1); > matrix/cblas-wrappers.h > inline void cblas_Xgemv > cblas_sgemv(CblasRowMajor, static_cast(trans), num_rows, num_cols, alpha, Mdata, stride, xdata, incX, beta, ydata, incY)

We use ATLAS 3.8.3 and this function seems to be nondeterministic. I'm not sure why. But I can demonstrate it gives different results depending on such "unrelated" things as wether the input to gmm-est-fmllr-gpost was fed by cat input |... or ...< input. I really hope this is not a buffer overrun issue, and bet for timing issues combined with associativity of floating point arthmetic.

I've patched the implementation of AddMatVec with a naive implementation:

if (trans==kTrans && alpha == 1 && beta == 1){ const Real * v_Data = v.Data(); const Real * M_Data = M.Data(); MatrixIndexT MStride = M.Stride(); for (MatrixIndexT j = 0; j < v.dim ; ++j) { for (MatrixIndexT i = 0; i < dim; ++i) { data[i] += v_Data[j] * M_Data[j*M_Stride+i]; } } return; }

and that seems to make gmm-est-fmllr-gpost results deterministic, similar (up to 6th decimal place at least) with the original code, and only slightly slower than original (1.5s vs .1.2s)

I'd love to know:

is the above code snippet correct?

is this some known issue with ATLAS 3.8.3?

Can I force ATLAS to be deterministic?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/kaldi-asr/kaldi/issues/717#issuecomment-262301965, or mute the thread https://github.com/notifications/unsubscribe-auth/ADJVu0chXLHFSTj7pmOgv0FZcmWnwYP7ks5rAyHrgaJpZM4ILIo0 .

sdeena commented 5 years ago

Why are the results in Kaldi split dependent? Theoretically, this should not be the case. I find non-significant variations in the WER when changing the number of splits in training and test.

danpovey commented 5 years ago

probably to do with dithering in the MFCC computation, you can do some google searches to find threads about this.

sdeena commented 5 years ago

But the results are repeatable when using the same number of splits across runs but they start diverging when the splits change and that whilst keeping the splits in feature extraction the same. So something else must be going on there?

danpovey commented 5 years ago

without knowing what decoding script you are using I couldn't say.

sdeena commented 5 years ago

I tried monophone training and decoding on the Librispeech recipe. When using 30 splits for training and decoding, I get the following result: %WER 44.48 [ 23386 / 52576, 860 ins, 6496 del, 16030 sub ] %SER 95.61 [ 2505 / 2620 ]

When using 50 splits, I get the following result: %WER 44.40 [ 23342 / 52576, 850 ins, 6523 del, 15969 sub ] %SER 95.69 [ 2507 / 2620 ]

Despite the difference being small, I notice that it accumulates on further stages such as triphone, lda-mllt, sat and the difference becomes quite substantive. Using the same number of splits gives the same result. As far as I understand, the results should not be split dependent. Is this a bug?

danpovey commented 5 years ago

I think the flat-start alignment calls rand() at some point, that would affect it. In any case I don't think it's a good idea to rely on that kind of complete repeatability, since the kinds of variations you'll get from changing the number of splits will be below the level of noise you'd expect from a genuine experimental change, so you need to learn not to read too much into the very slight variations in results.

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] commented 4 years ago

This issue has been automatically closed by a bot strictly because of inactivity. This does not mean that we think that this issue is not important! If you believe it has been closed hastily, add a comment to the issue and mention @kkm000, and I'll gladly reopen it.

kaldi-asr / kaldi

Issues in reproducibility of results in Kaldi #717

what is the typical variation you observed in results (absolute) after various neural nets training ?