hrvg / wateReview

Computational literature review of water resources research in Latin America and the Caribbean.
https://hrvg.github.io/wateReview
Other
2 stars 0 forks source link

Run BERT for multilabel classification for temporal scales #10

Closed hrvg closed 5 years ago

hrvg commented 5 years ago

Modify the code for the Kaggle Toxic Comment competition to tailor it to our problem.

hrvg commented 5 years ago

make_corpus_for_BERT creates the corpus needed for BERT. The number of classes has been updated in model architecture. Changes have been committed locally to avoid cluttering cache. Model is running.

hrvg commented 5 years ago

Training took time 0:21:49.638697

hrvg commented 5 years ago
{'0': 0.67785907,
 '1': 0.47482997,
 '2': 0.7207793,
 '3': 0.57806325,
 '4': 0.49122807,
 '5': 0.50000006,
 '6': 0.44230825,
 '7': 1.0,
 '8': 0.62179524,
 'eval_loss': 0.43003538,
 'global_step': 44,
 'loss': 0.42911974}

The first 9 values correspond to Area Under Curve values (AUC). Results are satisfactory for labels event and 100000_years, good for labels week and excellent for years_10000.

Next steps include:

See: https://github.com/google-research/bert

hrvg commented 5 years ago

Re-running the whole model with:

In [40]: num_train_steps
Out[40]: 235
hrvg commented 5 years ago
{'0': 0.63331115,
 '1': 0.49897963,
 '2': 0.6785716,
 '3': 0.5395257,
 '4': 0.68700165,
 '5': 0.6190477,
 '6': 0.66025674,
 '7': 1.0,
 '8': 0.41025698,
 'eval_loss': 0.44035244,
 'global_step': 235,
 'loss': 0.43410876}
hrvg commented 5 years ago

Re-running with three labels: short_term, long_term and very_long_term

hrvg commented 5 years ago
{'0': 0.50892854,
 '1': 0.5717241,
 '2': 0.55844176,
 'eval_loss': 0.56304604,
 'global_step': 235,
 'loss': 0.55524087}

Interestingly, results are worse with the combined labels. It is likely that combining the labels actually amplifies the noise from the human reading process. One solution would be to train with the nine labels and to then proceed to a combination a posteriori by assessing the overall probability to be short term, long term and very long term.

Now re-running with the nine labels, sequences of maximum length of 512 and an increased number of epochs (10).

hrvg commented 5 years ago
{'0': 0.5821144,
 '1': 0.4802721,
 '2': 0.477273,
 '3': 0.551054,
 '4': 0.6172248,
 '5': 0.37599218,
 '6': 1.0,
 '7': 1.0,
 '8': 0.9871795,
 'eval_loss': 0.44625485,
 'global_step': 1178,
 'loss': 0.4452529}
hrvg commented 5 years ago

Re-training with 128 tokens and higher number of training epochs (100).

hrvg commented 5 years ago
{'0': 0.61003995,
 '1': 0.48061225,
 '2': 0.80519485,
 '3': 0.5395257,
 '4': 0.6216108,
 '5': 0.4305556,
 '6': 1.0,
 '7': 1.0,
 '8': 0.974359,
 'eval_loss': 0.86895615,
 'global_step': 2209,
 'loss': 0.85478354}

The doubling of the loss with the increase of training epochs implies that the learning rate is too small causing instabilities in the gradient. I think I will work on getting the abstract to provide better constraints on the learning model.

hrvg commented 5 years ago

Design a BERT run with the abstracts with increasing length of token sequences, lower learning rate, higher number of epochs and implement accuracy metrics.

hrvg commented 5 years ago

Closing this issue has the development of a BERT-based solution for temporal scales does not seem to be the most relevant one.

Linked to #13