localminimum / QANet

A Tensorflow implementation of QANet for machine reading comprehension
MIT License
982 stars 310 forks source link

optimize the memory using of trilinear function; add early stop for training #16

Closed jasonwbw closed 6 years ago

jasonwbw commented 6 years ago

The main idea behind the optimization is that: The attention function of "[C, Q, C Q] dot W" can be split to "C dot W1 + Q dot W2 + (C Q) dot W3". Given that, we could perform the dot function before the expand_dims and tile, so that the last dimension can be reduced from HiddenSize to 1 (as the last dimension of W is 1). Btw, I think the current inputs of trilinear function obtain many memories, even the multi-head self-attention will cost more memories.