Open karino2 opened 5 years ago
Initial model. Data is not yet reduced by Ramer–Douglas–Peucker algorithm. Mask is just mean properly ignoring padding of sequence. This is the initial model and basically many parameter (more than training set size, I guess).
Reduce embed and hidden layer parameter from model_vec_mask3. Similar tendency with bad score.
Just reduce embed size from model_vec_mask3. Similar tendency to model_vec_mask3. A little bad score. Accuracy score is very fragile.
Reduce data by Ramer–Douglas–Peucker algorithm, epsilon 0.05. Model is the same as model_vec_mask3.
From here, we use this dataset.
This is the best acc of 0.39, but it seems too random picking.
small parameter trial for rdp reductioned dataset. More stable, but bad score.
Just reduce embed size from rdp_mask3. Accuracy is fragile, but score is similar (a little worse than mask3).
TCN for encoder. with avg pooling. depth=6. acc 0.335.
Add dropout in TCN (not in decoder, though). Raise learning rate 0.001 instead of 0.00009. learning rate seems too high.
Lower learning rate from above trial.
Dropout seems learning a little regularize, but the final score is not improved.
Add dropout to rdp_mask3. We call previous conv1d+RNN encoder model as cnnrnn2 from now on.
acc 0.3228. Accuracy goes down with dropout. I don't know the reason, but the score of this model is fragile and it might just pick randomly good value in previous trial.
From tcn_dropout, add dropout in decoder side, and apply FC in last output of tcn and plug into decoder's init state. acc 0.344
Training becomes stable, but score is the same.
Increase dilated layer to 8 to cover wider range for encoder init state. acc 0.347. A little better, but almost the same.
Add l2 regularization of 0.1 to weight, kernel, activation with previous model. acc = 0.34. The same score.
I realize that parser for tfrecord had bug and stroke is wrongly concatenated.
I had tried to reshape to
[[x1, y1, type1], [x2, y2, type2], ..., [xn, yn, typen]]
But the reality was
[[x1, x2, x3], [x4, x5, x6], ..., [type(n-2), type(n-1), typen]]
So it might be hard to match x, y, type relations. All previous experiments was conducted with this bug.
I fix parser with above model. acc=0.335. Almost the same.
Learning curve is changed, so fix seems applied. Hard to believe that the final score is almost the same.
cnnrnn2 with weight regularization the same as above model. And fix parser. acc = 0.407.
It seems this is the best model and this model can take stroke data into account. Almost perfectly fit to training set, though it's severely overfitted. (final training loss is 0.2179).
Reduce parameter size with above model. acc=0.379
Less overfit, and score is stable, but getting worse.
Increase dropout rate from 0.1 to 0.5 in above model. acc=0.377
Overfit is not improved.
Increase dropout rate to 0.9 with above model (cnnrnn2-small). acc=0.355. Becomes learned more gradually, but converge to the same score. Overfitting is not fixed with dropout.
This cell is move to #3
From discussion, I guess it might be better to compress each stroke to more compressed data. So create feature extractor from subtask one symbol dataset.
https://github.com/karino2/tegashiki/blob/master/tegashiki_symbol.ipynb
Basic design note:
acc: 0.36
No Improvement.
I couldn't believe previous result. It must be bug!
Previous dataset is different stroke num for training set and validation set. I guess this might cause some loss function calculation mismatch (I wonder attention code might not handle stroke mask correctly).
acc 0.34
No improvement. Seems not a bug.
No matter what I try, the accuracy tend to converge to 0.3. And training error lower much further, but not validation loss improvement.
Why? One guess is that decoder combination is too complex and not so much data, so our model can't learn structure with this model. The best he can do might be just predict prior distribution of math expression tokens.
To confirm my hypothesis, I create model that just ignore input stroke.
Just ignore input stroke and only train from decoder input-output. This is some form of math exp language model. It is most idiot model.
acc 0.38
Getting better!
All attempt so far seems just not working, ignore stroke input and just predict math expression general symbol distribution!
Previous experiment suggest current dataset might too complex. So filter dataset that tex symbol len seems too large, and just keep less than 10. Dataset becomes only 3000. It seems this is too small to learn. But I try anyway.
acc 0.31
No Improvement. It seems this task is impossible to solve!
Go to #3
Note of each model.