Closed yigit353 closed 3 years ago
Hi @yigit353 ,
very good question! In Flair e.g. we simply filter out these kind of lines (20-21). So instead of "giyimli" (which has no PoS tag in the dataset), we would use giyim
and li
with NOUN
and ADP
in training :)
Hi @stefan-it,
Thank you for the swift response. I just got results using dbmdz/bert-base-turkish-cased
and IMST dataset.
Dev performance:
eval_loss = 0.11717678606510162
eval_precision = 0.9535936615732881
eval_recall = 0.9558656682550488
eval_f1 = 0.9547283132188793
eval_accuracy = 0.967350189130002
eval_runtime = 3.5767
eval_samples_per_second = 276.232
epoch = 3.0
Training performance:
epoch = 3.0
train_runtime = 94.4194
train_samples_per_second = 4.861
Not exactly :
BERTurk (32k) | (0.9701) / 0.9712 |
---|
Is that expected?
After setting the same training hyperparameters:
Parameter | Value |
---|---|
batch_size |
16 |
learning_rate |
5e-5 |
num_epochs |
10 |
Test eval results (setting test set as validation_file) 02/26/2021 23:05:27 - INFO - main - Eval results 02/26/2021 23:05:27 - INFO - main - eval_loss = 0.14761877059936523 02/26/2021 23:05:27 - INFO - main - eval_precision = 0.9570830030574113 02/26/2021 23:05:27 - INFO - main - eval_recall = 0.9592554761094086 02/26/2021 23:05:27 - INFO - main - eval_f1 = 0.9581680081623398 02/26/2021 23:05:27 - INFO - main - eval_accuracy = 0.9693887725595772 02/26/2021 23:05:27 - INFO - main - eval_runtime = 3.4074 02/26/2021 23:05:27 - INFO - main - eval_samples_per_second = 288.49 02/26/2021 23:05:27 - INFO - main - epoch = 10.0
Dev eval results (setting dev set as validation_file) 02/26/2021 22:57:40 - INFO - main - Eval results 02/26/2021 22:57:40 - INFO - main - eval_loss = 0.14820729196071625 02/26/2021 22:57:40 - INFO - main - eval_precision = 0.9599275526375368 02/26/2021 22:57:40 - INFO - main - eval_recall = 0.9621057408668028 02/26/2021 22:57:40 - INFO - main - eval_f1 = 0.9610154125113327 02/26/2021 22:57:40 - INFO - main - eval_accuracy = 0.9724268365518615 02/26/2021 22:57:40 - INFO - main - eval_runtime = 3.505 02/26/2021 22:57:40 - INFO - main - eval_samples_per_second = 281.88 02/26/2021 22:57:40 - INFO - main - epoch = 10.0
Reported results in the repo:
BERTurk (32k) | (0.9701) / 0.971 |
---|
My results:
BERTurk (32k) | (0.972) / 0.969 |
---|
Close enough :)
I managed to run the NER example with the custom data using
run_ner.py
from transformers. The data looks like below after JSON formatting.However, for PoS tagging the IMST data looks like this:
A word might have multiple PoS tags such as
giyimli
. However, in the case of NER, one word only matches one NER tag. So, how can we create a JSON file that can be correctly parsed by run_ner.py for PoS Tagging?Thank you!