Run BERT for multilabel classification for temporal scales

hrvg / wateReview

Computational literature review of water resources research in Latin America and the Caribbean.

https://hrvg.github.io/wateReview

Other

2 stars 0 forks source link

Run BERT for multilabel classification for temporal scales #10

Closed hrvg closed 5 years ago

hrvg commented 5 years ago

Modify the code for the Kaggle Toxic Comment competition to tailor it to our problem.

change the input files and directories
change the ouput files and directories
change the number of classes to predict in the model architecture
(maybe) change the resampling or do a shuffle

hrvg commented 5 years ago

make_corpus_for_BERT creates the corpus needed for BERT. The number of classes has been updated in model architecture. Changes have been committed locally to avoid cluttering cache. Model is running.

hrvg commented 5 years ago

Training took time 0:21:49.638697

hrvg commented 5 years ago

{'0': 0.67785907,
 '1': 0.47482997,
 '2': 0.7207793,
 '3': 0.57806325,
 '4': 0.49122807,
 '5': 0.50000006,
 '6': 0.44230825,
 '7': 1.0,
 '8': 0.62179524,
 'eval_loss': 0.43003538,
 'global_step': 44,
 'loss': 0.42911974}

The first 9 values correspond to Area Under Curve values (AUC). Results are satisfactory for labels event and 100000_years, good for labels week and excellent for years_10000.

Next steps include:

investigating the number of steps of the model: NUM_TRAIN_EPOCHS, BATCH_SIZE
investigating the length of the tokens: MAX_SEQ_LENGTH, BATCH_SIZE
- This might have the strongest influence has longer example are truncated to 128 tokens
investigating the labels

See: https://github.com/google-research/bert

hrvg commented 5 years ago

Re-running the whole model with:

MAX_SEQ_LENGTH = 512
BATCH_SIZE = 6

In [40]: num_train_steps
Out[40]: 235

hrvg commented 5 years ago

{'0': 0.63331115,
 '1': 0.49897963,
 '2': 0.6785716,
 '3': 0.5395257,
 '4': 0.68700165,
 '5': 0.6190477,
 '6': 0.66025674,
 '7': 1.0,
 '8': 0.41025698,
 'eval_loss': 0.44035244,
 'global_step': 235,
 'loss': 0.43410876}

hrvg commented 5 years ago

Re-running with three labels: short_term, long_term and very_long_term

hrvg commented 5 years ago

{'0': 0.50892854,
 '1': 0.5717241,
 '2': 0.55844176,
 'eval_loss': 0.56304604,
 'global_step': 235,
 'loss': 0.55524087}

Interestingly, results are worse with the combined labels. It is likely that combining the labels actually amplifies the noise from the human reading process. One solution would be to train with the nine labels and to then proceed to a combination a posteriori by assessing the overall probability to be short term, long term and very long term.

Now re-running with the nine labels, sequences of maximum length of 512 and an increased number of epochs (10).

hrvg commented 5 years ago

{'0': 0.5821144,
 '1': 0.4802721,
 '2': 0.477273,
 '3': 0.551054,
 '4': 0.6172248,
 '5': 0.37599218,
 '6': 1.0,
 '7': 1.0,
 '8': 0.9871795,
 'eval_loss': 0.44625485,
 'global_step': 1178,
 'loss': 0.4452529}

hrvg commented 5 years ago

Re-training with 128 tokens and higher number of training epochs (100).

hrvg commented 5 years ago

{'0': 0.61003995,
 '1': 0.48061225,
 '2': 0.80519485,
 '3': 0.5395257,
 '4': 0.6216108,
 '5': 0.4305556,
 '6': 1.0,
 '7': 1.0,
 '8': 0.974359,
 'eval_loss': 0.86895615,
 'global_step': 2209,
 'loss': 0.85478354}

The doubling of the loss with the increase of training epochs implies that the learning rate is too small causing instabilities in the gradient. I think I will work on getting the abstract to provide better constraints on the learning model.

hrvg commented 5 years ago

Design a BERT run with the abstracts with increasing length of token sequences, lower learning rate, higher number of epochs and implement accuracy metrics.

hrvg commented 5 years ago

Closing this issue has the development of a BERT-based solution for temporal scales does not seem to be the most relevant one.

Linked to #13