Orthogonal matrix initialization

This is a feature that I'd like to be added which might improve results across the board. Please do this in the kaldi_52 branch. @keli78, I'd like you to do this (get help from someone) to improve your C++/nnet3 familiarity.

See this blog post http://hjweide.github.io/orthogonal-initialization-in-convolutional-layers for some background. This is applicable to affine and convolutional layers.

First, I'd like the following function to be added in nnet-utils.h (I couldn't decide where to put it, this location isn't ideal) and I may decide later to move it to kaldi-matrix.h or the like:

/// This function is used in orthogonal initialization of parameters in neural
/// networks, as recommended in the paper "Exact solutions to the nonlinear
///  dynamics of learning in deep linear neural networks"
/// https://arxiv.org/pdf/1312.6120.pdf.
/// If mat->NumRows() is <= mat->NumCols(), then this function sets the
/// rows of the matrix to random orthogonal vectors with equal norm, scaled
/// in such a way that the vector 2-norm (Frobenius norm) of the matrix is 
/// distributed the same as if you had called mat->SetRandn().
/// If mat->NumRows() > mat->NumCols(), then replace "rows" in the previous
/// sentence with "columns".
void SetRandOrthogonal(CuMatrix<BaseFloat> *mat);

In the implementation, please do something like this:

void SetRandOrthogonal(CuMatrix<BaseFloat> *mat) {
  if (mat has more cols than rows) {
    CuMatrix<BaseFloat> mat_trans(mat->NumCols(), mat->NumRows());
    SetRandOrthogonal(mat_trans);
    // copy to mat, transposed;
    return;
  }
  // Now assume that num_rows >= num_cols, and we'll make the
  // columns orthogonal.  Doing it this way round (num_rows >= num_cols)
  // saves a temporary matrix inside Svd (note: because LAPACK has a
  // FORTRAN origin, it's very oriented towards operation on columns,
  // which is super-suboptimal in C/C++ from a memory access order perspective,
  // but that project has a lot of inertia).

  // note to Ke: some functions are only available in class Matrix, you will have to create
  // a temporary (non-Cu) Matrix and copy, so don't take the below code
  // literally e.g. when I say mat->SetRandn().

  // call mat->SetRandn();
  BaseFloat old_norm = mat->FrobeniusNorm();
  // Do SVD with mat->DestructiveSvd(), mat = U*diag(S)*Vt. 
  // We don't care about Vt, so make it NULL.  U will have the same
  // dim as 'mat', and after SVD it will have orthonormal rows.
  mat->CopyFromMat(U);
  // Scale 'mat' by a positive scalar chosen so that its frobenius norm is
  // the same as it was before we did the SVD:
  mat->Scale(old_norm / mat->FrobeniusNorm());
}

Next, we'll have to change the way the components initialize their matrices, to optionally use the orthogonal initialization. You'll change class NaturalGradientAffineComponent. There is a comment above its definition in nnet-simple-component.h that describes the config parameters it accepts in the config file. After bias-mean, add the following:

            init-orthogonal             If true, use orthogonal initialization as
                                                  recommended in https://arxiv.org/abs/1312.6120.
                                                  Defaults to true.

(Making it set to true makes it non-back-compatible but this will probably improve all recipes; we'll do some testing). You will add an extra argument init_orthogonal to the first Init() function, after bias_mean. In its definition you'll change linear_params_.SetRandn(); to:

   if (init_orthogonal)   SetRandOrthogonal(&linear_params_);
  else  linear_params_.SetRandn();

You'll change InitFromConfig() to accept the new value. No need to check the return status of cfl->GetValue() because the parameter is optional.

Also please make similar changes to class TimeHeightConvolutionComponent in nnet-convolutional-component.{h,cc}. It's a bit simpler there because there is no separate Init() function. Also have the parameter default to true.

You'll have to run some experiments (any setup) to see if it makes any difference to the results.

I may not merge this right away because it requires some testing.

Note: for testing purposes, in TDNN-type and LSTM-type layers you can set init-orthonormal=false in the ng-affine-options to get the old behavior, if needed. Note: ng-affine-options defaults to 'max-change=0.75', so you'd have to change the xconfig line by adding to it: ng-affine-options="max-change=0.75 init-orthonormal=false" Note: after you do that, if the double quotes cause a crash in xconfig_to_configs.py, try single quotes, I forget which is supported.

kaldi-asr / kaldi

Orthogonal matrix initialization #1584