stefan-it / turkish-bert

Turkish BERT/DistilBERT, ELECTRA and ConvBERT models
482 stars 42 forks source link

PoS tagging #26

Closed yigit353 closed 3 years ago

yigit353 commented 3 years ago

I managed to run the NER example with the custom data using run_ner.py from transformers. The data looks like below after JSON formatting.

{"tokens":["Yıldız","Savaşları",":","Bölüm","II","-","Klonların","Saldırısı","''"], "ner_tags":["B-ORG","I-ORG","I-ORG","I-ORG","I-ORG","I-ORG","I-ORG","I-ORG","O"]}
{"tokens":["1998-2004",":","Kombassan","Holding"], "ner_tags":["O","O","B-ORG","I-ORG"]}
{"tokens":["Avustralya'da","25","numaraya","çıkmış",",","ayrıca","Yeni","Zelanda","listesine","32","numaradan","giriş","yapmış","ve","8","numaraya","çıkmıştır","."], "ner_tags":["B-LOC","O","O","O","O","O","B-LOC","I-LOC","O","O","O","O","O","O","O","O","O","O"]}
{"tokens":["Piet","Mondrian","(","1872-1944",")"], "ner_tags":["B-PER","I-PER","O","O","O"]}

However, for PoS tagging the IMST data looks like this:

16  ,   ,   PUNCT   Punc    _   24  punct   _   _
17  yetmiş  yetmiş  NUM ANum    NumType=Card    18  nummod  _   _
18  yaşlarında  yaş ADJ NAdj    Case=Loc|Number=Plur|Number[psor]=Sing|Person=3|Person[psor]=3  23  amod    _   _
19  şık şık ADJ Adj _   20  amod    _   _
20-21   giyimli _   _   _   _   _   _   _   _
20  giyim   giyim   NOUN    Noun    Case=Nom|Number=Sing|Person=3   23  obl _   _
21  li  li  ADP With    _   20  case    _   _

A word might have multiple PoS tags such as giyimli. However, in the case of NER, one word only matches one NER tag. So, how can we create a JSON file that can be correctly parsed by run_ner.py for PoS Tagging?

Thank you!

stefan-it commented 3 years ago

Hi @yigit353 ,

very good question! In Flair e.g. we simply filter out these kind of lines (20-21). So instead of "giyimli" (which has no PoS tag in the dataset), we would use giyim and li with NOUN and ADP in training :)

yigit353 commented 3 years ago

Hi @stefan-it, Thank you for the swift response. I just got results using dbmdz/bert-base-turkish-cased and IMST dataset.

Dev performance:

eval_loss = 0.11717678606510162
eval_precision = 0.9535936615732881
eval_recall = 0.9558656682550488
eval_f1 = 0.9547283132188793
eval_accuracy = 0.967350189130002
eval_runtime = 3.5767
eval_samples_per_second = 276.232
epoch = 3.0

Training performance:

epoch = 3.0
train_runtime = 94.4194
train_samples_per_second = 4.861

Not exactly :

BERTurk (32k) (0.9701) / 0.9712

Is that expected?

yigit353 commented 3 years ago

After setting the same training hyperparameters:

Parameter Value
batch_size 16
learning_rate 5e-5
num_epochs 10

Test eval results (setting test set as validation_file) 02/26/2021 23:05:27 - INFO - main - Eval results 02/26/2021 23:05:27 - INFO - main - eval_loss = 0.14761877059936523 02/26/2021 23:05:27 - INFO - main - eval_precision = 0.9570830030574113 02/26/2021 23:05:27 - INFO - main - eval_recall = 0.9592554761094086 02/26/2021 23:05:27 - INFO - main - eval_f1 = 0.9581680081623398 02/26/2021 23:05:27 - INFO - main - eval_accuracy = 0.9693887725595772 02/26/2021 23:05:27 - INFO - main - eval_runtime = 3.4074 02/26/2021 23:05:27 - INFO - main - eval_samples_per_second = 288.49 02/26/2021 23:05:27 - INFO - main - epoch = 10.0

Dev eval results (setting dev set as validation_file) 02/26/2021 22:57:40 - INFO - main - Eval results 02/26/2021 22:57:40 - INFO - main - eval_loss = 0.14820729196071625 02/26/2021 22:57:40 - INFO - main - eval_precision = 0.9599275526375368 02/26/2021 22:57:40 - INFO - main - eval_recall = 0.9621057408668028 02/26/2021 22:57:40 - INFO - main - eval_f1 = 0.9610154125113327 02/26/2021 22:57:40 - INFO - main - eval_accuracy = 0.9724268365518615 02/26/2021 22:57:40 - INFO - main - eval_runtime = 3.505 02/26/2021 22:57:40 - INFO - main - eval_samples_per_second = 281.88 02/26/2021 22:57:40 - INFO - main - epoch = 10.0

Reported results in the repo:

BERTurk (32k) (0.9701) / 0.971

My results:

BERTurk (32k) (0.972) / 0.969

Close enough :)