Questions on tokenisation and computing gradients for BERT

larrylawl commented 3 years ago

Hi, thanks for releasing this code! Can I clarify

1. Why is the tokenisation step done as

https://github.com/successar/Eraser-Benchmark-Baseline-Models/blob/894bfba09e8966aec9b046ddc595d434504a4f90/Rationale_model/data/dataset_readers/rationale_reader.py#L89

instead of using a Bert wordpiece tokeniser? As an example, see lines 6-9 of the config file in this allennlp guide:

"tokenizer": {
            "type": "pretrained_transformer",
            "model_name": bert_model,
        },

2. Why do we require gradients for the encoder's embeddings?

https://github.com/successar/Eraser-Benchmark-Baseline-Models/blob/894bfba09e8966aec9b046ddc595d434504a4f90/Rationale_model/training_config/classifiers/bert_encoder_generator.jsonnet#L72

Thanks for this code base and for releasing the ERASER dataset once again!

successar commented 3 years ago

Hi

For the tokenization, we need word level tokens since our rationale labels are word level. Internally, the model generates word level embeddings from workpiece level by summing.

During training, instead of training the whole model, we train only top-2 layers plus the pooler. This reduces our training time.

larrylawl commented 3 years ago

Got, it. Thanks for taking the time to explain!

successar / Eraser-Benchmark-Baseline-Models

Questions on tokenisation and computing gradients for BERT #4