Open karino2 opened 5 years ago
Model is the same as padstroke_small_rnn_small_dropout05 in #1 . Feature extractor with GRU encoder-decoder-attention model.
acc 0.956
Impressive score. It seems generating data from symbol data seems better strategy for current stage.
It's not clear why the score is better than single symbol prediction. But this dataset contains only alphabet (include common math symbol) and number, so it might be easier to distinguish than one symbol task.
Above model is very nice, but it was hard to transform to TensorFlow Lite because RNN is only experimental stage (their dynamic_rnn function generate grap that is hard to train in TPU (dynamic shape even though I supply all shape)).
So I explorer CNN based encoder-decoder model instead.
Feature extractor create list of stroke features. Decoder conv1d-ed teacher force input. Add attention for this output and stroke features. Add absolute position to decoder input embed.
conv1d filter size is 3, kernel size is 8.
acc: 0.2
Out of question...
Above result experiment seems training loss itself is also not decrease enough. So just add more parameter in conv1d. kernel size 5. filter size 128.
acc: 0.237
Still score is out of question. (though graph shape is much nicer...)
Add absolute position to stroke feature too.
acc 0.275
Out of question.
Add conv1d to encoder side too.
acc: 0.233
no improments.
Add embedding layer to abosolute pos before adding up.
acc 0.86
Yes! getting better! I mis-read ConvS2S paper. We need embedding layer.
Score is lower than GRU based model. But this is worth trying score. Let's covert to TF Lite.
Embedding layer with generated tensor (not input) cause TOCO converter to fail. (positional encoding needs this).
It seems tf.gather and dynamically generated tensor cause TOCO converter failure (?).
So I create my own embedding layer. Create one_hot vector and matmul to -1 to 1 uniform initialized weight matrix.
acc: 0.73
Getting score worse. But still working and can convert to TF Lite model.
Keras Embedding layer seems weight matrix initialized by -0.05 to 0.05 uniform distribution. I setup the same initialization.
acc: 0.84
Now score becomes almost identical to keras Embedding layer (though regularization seems a little different).
Anyway, we have working model that can convert to TF Lite at last!
Previous model was bug that future input information is leaked via layer normalization. So just drop this layer and train.
acc: 0.75
Getting worse, but better than random. So this is enough to check in real device.
Original dataset seems too difficult (too small size for it's complexity). I generate far easier dataset that contains
from subtask symbol training dataset. Validation set is almost the same ratio by subtask validation set dataset.
https://github.com/karino2/tegashiki/blob/master/tegashiki_mathexp_generate.ipynb