Closed hrvg closed 5 years ago
make_corpus_for_BERT
creates the corpus needed for BERT.
The number of classes has been updated in model architecture.
Changes have been committed locally to avoid cluttering cache.
Model is running.
Training took time 0:21:49.638697
{'0': 0.67785907,
'1': 0.47482997,
'2': 0.7207793,
'3': 0.57806325,
'4': 0.49122807,
'5': 0.50000006,
'6': 0.44230825,
'7': 1.0,
'8': 0.62179524,
'eval_loss': 0.43003538,
'global_step': 44,
'loss': 0.42911974}
The first 9 values correspond to Area Under Curve values (AUC). Results are satisfactory for labels event
and 100000_years
, good for labels week
and excellent for years_10000
.
Next steps include:
NUM_TRAIN_EPOCHS
, BATCH_SIZE
MAX_SEQ_LENGTH
, BATCH_SIZE
Re-running the whole model with:
MAX_SEQ_LENGTH = 512
BATCH_SIZE = 6
In [40]: num_train_steps
Out[40]: 235
{'0': 0.63331115,
'1': 0.49897963,
'2': 0.6785716,
'3': 0.5395257,
'4': 0.68700165,
'5': 0.6190477,
'6': 0.66025674,
'7': 1.0,
'8': 0.41025698,
'eval_loss': 0.44035244,
'global_step': 235,
'loss': 0.43410876}
Re-running with three labels: short_term
, long_term
and very_long_term
{'0': 0.50892854,
'1': 0.5717241,
'2': 0.55844176,
'eval_loss': 0.56304604,
'global_step': 235,
'loss': 0.55524087}
Interestingly, results are worse with the combined labels. It is likely that combining the labels actually amplifies the noise from the human reading process. One solution would be to train with the nine labels and to then proceed to a combination a posteriori by assessing the overall probability to be short term, long term and very long term.
Now re-running with the nine labels, sequences of maximum length of 512 and an increased number of epochs (10).
{'0': 0.5821144,
'1': 0.4802721,
'2': 0.477273,
'3': 0.551054,
'4': 0.6172248,
'5': 0.37599218,
'6': 1.0,
'7': 1.0,
'8': 0.9871795,
'eval_loss': 0.44625485,
'global_step': 1178,
'loss': 0.4452529}
Re-training with 128 tokens and higher number of training epochs (100).
{'0': 0.61003995,
'1': 0.48061225,
'2': 0.80519485,
'3': 0.5395257,
'4': 0.6216108,
'5': 0.4305556,
'6': 1.0,
'7': 1.0,
'8': 0.974359,
'eval_loss': 0.86895615,
'global_step': 2209,
'loss': 0.85478354}
The doubling of the loss with the increase of training epochs implies that the learning rate is too small causing instabilities in the gradient. I think I will work on getting the abstract to provide better constraints on the learning model.
Design a BERT run with the abstracts with increasing length of token sequences, lower learning rate, higher number of epochs and implement accuracy metrics.
Closing this issue has the development of a BERT-based solution for temporal scales does not seem to be the most relevant one.
Linked to #13
Modify the code for the Kaggle Toxic Comment competition to tailor it to our problem.