Changed things as mentioned in issue 6

In particular, the following changes should be of note:

Every nn.Linear instantiation is now replaced by the get_linear function, which returns an nn.Linear with xavier init. This also affects how the E and F matrices were initialized.
There are no more w_q, w_k, and w_v matrices in the LinearAttentionHead module. Instead, in the MHAttention module, to_{q,k,v} is now a ModuleList, and there are nhead nn.Linear layers in each of them, each corresponding to the original weight matrix in the original paper.
Fixed a bug where there were still **kwargs in the checkpoint function "C2".

tatp22 / linformer-pytorch