optimize the trilinear function for lower memory cost; add early stop

Reduce the memory cost of trilinear from n BatchSize C_len Q_len HiddenSize to n BatchSize C_len Q_len, n is an integer. Which is about n 0.23G -> n 0.002G for HiddenSize 96, and about n 0.31G -> n * 0.002G for HiddenSize 128.
Add early stop as current pipeline just save the last five models, it's very easy to overfitting so that we can't get the actually EM and F1 of dev set (the log is reported by the max length as 400, which is lower than the actual numbers).

Btw, I'm sorry to bring in lots of commits when I pull the version of upstream. It may be cost by the bug of the Github windows GUI client I used.

localminimum / QANet