model history note - Githubissues

karino2 commented 5 years ago

Note of each model.

karino2 commented 5 years ago

model_vec_mask3

Initial model. Data is not yet reduced by Ramer–Douglas–Peucker algorithm. Mask is just mean properly ignoring padding of sequence. This is the initial model and basically many parameter (more than training set size, I guess).

model_vec_mask3

karino2 commented 5 years ago

model_vec_small

Reduce embed and hidden layer parameter from model_vec_mask3. Similar tendency with bad score.

model_vec_small

karino2 commented 5 years ago

small_embed

Just reduce embed size from model_vec_mask3. Similar tendency to model_vec_mask3. A little bad score. Accuracy score is very fragile.

small_embed

karino2 commented 5 years ago

rdp_mask3

Reduce data by Ramer–Douglas–Peucker algorithm, epsilon 0.05. Model is the same as model_vec_mask3.

From here, we use this dataset.

This is the best acc of 0.39, but it seems too random picking.

rdp_mask3

karino2 commented 5 years ago

rdp_small

small parameter trial for rdp reductioned dataset. More stable, but bad score.

rdp_small

karino2 commented 5 years ago

rdp_small_embed

Just reduce embed size from rdp_mask3. Accuracy is fragile, but score is similar (a little worse than mask3).

small_embed

karino2 commented 5 years ago

tcn_avg

TCN for encoder. with avg pooling. depth=6. acc 0.335.

tcn_avg

karino2 commented 5 years ago

tcn_dropout_lr0001

Add dropout in TCN (not in decoder, though). Raise learning rate 0.001 instead of 0.00009. learning rate seems too high.

tcn_dropout_lr0001

karino2 commented 5 years ago

tcn_dropout

Lower learning rate from above trial.

Dropout seems learning a little regularize, but the final score is not improved.

tcn_dropout

karino2 commented 5 years ago

cnnrnn2_dropout

Add dropout to rdp_mask3. We call previous conv1d+RNN encoder model as cnnrnn2 from now on.

acc 0.3228. Accuracy goes down with dropout. I don't know the reason, but the score of this model is fragile and it might just pick randomly good value in previous trial.

cnnrnn2_dropout

karino2 commented 5 years ago

tcn_drop2_initstate

From tcn_dropout, add dropout in decoder side, and apply FC in last output of tcn and plug into decoder's init state. acc 0.344

Training becomes stable, but score is the same.

tcn_drop2_initstate_fix

karino2 commented 5 years ago

tcn_drop2_d8

Increase dilated layer to 8 to cover wider range for encoder init state. acc 0.347. A little better, but almost the same.

tcn_drop2_d8

karino2 commented 5 years ago

tcn_regu

Add l2 regularization of 0.1 to weight, kernel, activation with previous model. acc = 0.34. The same score.

tcn_regu

karino2 commented 5 years ago

Parser fix

I realize that parser for tfrecord had bug and stroke is wrongly concatenated.

I had tried to reshape to

[[x1, y1, type1], [x2, y2, type2], ..., [xn, yn, typen]]

But the reality was

[[x1, x2, x3], [x4, x5, x6], ..., [type(n-2), type(n-1), typen]]

So it might be hard to match x, y, type relations. All previous experiments was conducted with this bug.

karino2 commented 5 years ago

tcn_regu_fixparser

I fix parser with above model. acc=0.335. Almost the same.

Learning curve is changed, so fix seems applied. Hard to believe that the final score is almost the same.

tcn_regu_fixparser

karino2 commented 5 years ago

cnnrnn_regular_fixparser

cnnrnn2 with weight regularization the same as above model. And fix parser. acc = 0.407.

It seems this is the best model and this model can take stroke data into account. Almost perfectly fit to training set, though it's severely overfitted. (final training loss is 0.2179).

cnnrnn2_regu_fixparser

karino2 commented 5 years ago

cnnrnn2_regular_fixparser_small

Reduce parameter size with above model. acc=0.379

Less overfit, and score is stable, but getting worse.

cnnrnn2_fixparser_small

karino2 commented 5 years ago

cnnrnn2_drop05

Increase dropout rate from 0.1 to 0.5 in above model. acc=0.377

Overfit is not improved.

cnnrnn2_small_drop05

karino2 commented 5 years ago

cnnrnn2_drop09

Increase dropout rate to 0.9 with above model (cnnrnn2-small). acc=0.355. Becomes learned more gradually, but converge to the same score. Overfitting is not fixed with dropout.

cnnrnn2_small_drop09

karino2 commented 5 years ago

This cell is move to #3

karino2 commented 5 years ago

Use feature extractor

From discussion, I guess it might be better to compress each stroke to more compressed data. So create feature extractor from subtask one symbol dataset.

https://github.com/karino2/tegashiki/blob/master/tegashiki_symbol.ipynb

Basic design note:

Conv-MaxPool-Conv-MaxPool-Conv-GlobalMaxPooling1D network to extract feature as 256 dim vector
feature extractor is trained by connected to GRU, one symbol predict task (by last GRU output). accuracy is 0.758
Score seems a ilttle bit low. But original dataset is not so clean and even I can't understand some symbols.

karino2 commented 5 years ago

padstroke_small_rnn_small_dropout05

Use feature extractor
Create stroke data the same as feature extractor task (padding and make one stroke same size (50))
Add GRU encoder and decoder after extracted feature, then add attention (stroke num is small enough that we can train without pooling)
Parameter is almost the same as cnnrnn2_regular_fixparser_small

acc: 0.36

No Improvement.

padstroke_small_rnn_small_dropout05

karino2 commented 5 years ago

samelen_small_rnn_small_dropout05

I couldn't believe previous result. It must be bug!

Previous dataset is different stroke num for training set and validation set. I guess this might cause some loss function calculation mismatch (I wonder attention code might not handle stroke mask correctly).

acc 0.34

No improvement. Seems not a bug.

samelen_small_rnn_small_dropout05

karino2 commented 5 years ago

No matter what I try, the accuracy tend to converge to 0.3. And training error lower much further, but not validation loss improvement.

Why? One guess is that decoder combination is too complex and not so much data, so our model can't learn structure with this model. The best he can do might be just predict prior distribution of math expression tokens.

To confirm my hypothesis, I create model that just ignore input stroke.

samelen_nostroke

Just ignore input stroke and only train from decoder input-output. This is some form of math exp language model. It is most idiot model.

acc 0.38

samelen_nostroke

Getting better!

All attempt so far seems just not working, ignore stroke input and just predict math expression general symbol distribution!

karino2 commented 5 years ago

varlen_10_small_rnn_small_dropout05

Previous experiment suggest current dataset might too complex. So filter dataset that tex symbol len seems too large, and just keep less than 10. Dataset becomes only 3000. It seems this is too small to learn. But I try anyway.

acc 0.31

varlen_10_small_rnn_small_dropout05

No Improvement. It seems this task is impossible to solve!

karino2 commented 5 years ago

Go to #3

karino2 / tegashiki

model history note #1

model_vec_mask3

model_vec_small

small_embed

rdp_mask3

rdp_small

rdp_small_embed

tcn_avg

tcn_dropout_lr0001

tcn_dropout

cnnrnn2_dropout

tcn_drop2_initstate

tcn_drop2_d8

tcn_regu

Parser fix

tcn_regu_fixparser

cnnrnn_regular_fixparser

cnnrnn2_regular_fixparser_small

cnnrnn2_drop05

cnnrnn2_drop09

Use feature extractor

padstroke_small_rnn_small_dropout05

samelen_small_rnn_small_dropout05

samelen_nostroke

varlen_10_small_rnn_small_dropout05