Closed larrylawl closed 3 years ago
Hi
For the tokenization, we need word level tokens since our rationale labels are word level. Internally, the model generates word level embeddings from workpiece level by summing.
During training, instead of training the whole model, we train only top-2 layers plus the pooler. This reduces our training time.
Got, it. Thanks for taking the time to explain!
Hi, thanks for releasing this code! Can I clarify
1. Why is the tokenisation step done as
https://github.com/successar/Eraser-Benchmark-Baseline-Models/blob/894bfba09e8966aec9b046ddc595d434504a4f90/Rationale_model/data/dataset_readers/rationale_reader.py#L89
instead of using a Bert wordpiece tokeniser? As an example, see lines 6-9 of the config file in this allennlp guide:
2. Why do we require gradients for the encoder's embeddings?
https://github.com/successar/Eraser-Benchmark-Baseline-Models/blob/894bfba09e8966aec9b046ddc595d434504a4f90/Rationale_model/training_config/classifiers/bert_encoder_generator.jsonnet#L72
Thanks for this code base and for releasing the ERASER dataset once again!