ufal / bert-diacritics-restoration

Repository storing code and data for our paper "Diacritics Restoration using BERT with Analysis on Czech language".
Apache License 2.0
7 stars 0 forks source link

prediction not really working #3

Closed jbellic closed 3 years ago

jbellic commented 3 years ago

Hi,

I've spent some time for training a small cs dataset (downloaded from given link) by following these steps:

(for the sake of time all input/target train,dev and test files point to the same files) INPUT_TRAIN_FILE="data/target_train_small_stripped.txt" TARGET_TRAIN_FILE="data/target_train_small.txt" INPUT_DEV_FILE="data/target_train_small_stripped.txt" TARGET_DEV_FILE="data/target_train_small.txt" INPUT_TEST_FILE="data/target_train_small_stripped.txt" TARGET_TEST_FILE="data/target_train_small.txt" LABELS="data/subwords.txt" OUTPUT_DIR="checkpoints" DATA_DIR="cache" NUM_EPOCHS=1 BERT_MODEL=bert-base-multilingual-uncased TOKENIZER_NAME=bert-base-multilingual-uncased MAX_LENGTH=128 BATCH_SIZE=64 GRAD_ACC_STEPS=32 SAVE_STEPS=400 SEED=1

(not necessary but still needed by arg parser) INPUT_TRAIN_FILE="data/target_train_small_stripped.txt" TARGET_TRAIN_FILE="data/target_train_small.txt" INPUT_DEV_FILE="data/target_train_small_stripped.txt" TARGET_DEV_FILE="data/target_train_small.txt" INPUT_TEST_FILE="data/target_train_small_stripped.txt" TARGET_TEST_FILE="data/target_train_small.txt"

TOKENIZER_NAME=bert-base-multilingual-uncased MAX_LENGTH=128 BERT_MODEL=bert-base-multilingual-uncased

python run_diacritization.py \ --data_dir $DATA_DIR \ --model_name_or_path $MODEL \ --tokenizer_name $TOKENIZER_NAME \ --output_dir $MODEL \ --per_device_eval_batch_size 1 \ --max_seq_length $MAX_LENGTH \ --cache_dir $DATA_DIR \ --do_predict \ --prediction_file_path $IN_FILE > $OUT_FILE \ --input_train_file "$INPUT_TRAIN_FILE" \ --target_train_file "$TARGET_TRAIN_FILE" \ --input_dev_file "$INPUT_DEV_FILE" \ --target_dev_file "$TARGET_DEV_FILE" \ --input_test_file "$INPUT_TEST_FILE" \ --target_test_file "$TARGET_TEST_FILE"

The output was: b'roku 1522 oblehal osmansky sultan sulejman i .' b'rhodos ( k dispozici mel 400 lodi a 200 tisic vojaku , zatimco rad disponoval jen 7 tisici rytiri ) .' b'po sestimesicnim oblezeni byl rytirum povolen odchod z ostrova .' b'roku 1530 jim cisar svate rise rimske a spanelsky kral karel v. a papez klement viii . udelili v leno souostrovi malty za symbolickou povinnost odvadet jednoho maltskeho sokola rocne .' b'k vyrazne dulezitejsi roli radu vsak patril boj proti muslimskym piratum , kteri ze svych malych pristavu na severoafrickem pobrezi znepokojovali velky pocet mest a osad na evropskem pobrezi stredozemniho more .' b'k nejznamejsim korzarum patril napr . chajruddin barbarossa ( zvany tez khair ad-din ) a jeho bratr oruc , pozdeji pirat dragut , kteri stali ve sluzbach osmanskych sultanu .' b'roku 1551 prepadli pirati ostrov gozo na malte a unesli do otroctvi 5000 az 6000 obyvatel ostrova ( temer veskere obyvatelstvo ) , 1561 zajali katanskeho biskupa niccolu carraciolu , ktery cestoval na galere maltezskeho radu .' b've velke a krvave bitve zde rad po dlouhem oblezeni 1565 zvitezil nad osmanskymi vojsky .' b'zalozil hlavni mesto a pevnost vallettu a opevnil cely ostrov ( pod vedenim tehdejsiho velmistra jean parisot de la valette ) .' b'v nove zrizene nemocnici byli osetrovani pacienti z cele stredomorske oblasti .' b'johanite vybudovali spravni aparat .' b'pod jejich vedenim se malta stala suverennim statem .' b'roku 1798 musel rad kapitulovat pred napoleonem bonaparte , tahnoucim se svou armadou na sve egyptske tazeni ( ridice se svym heslem \xe2\x80\x9enepozvedni mece proti bratru krestanovi \xe2\x80\x9c ) , a usadil se na nekolik desetileti v rusku , pricemz ztratil behem dalsich desetileti vetsinu svych prevorstvi ( tzn . pozemku a jmeni ) a tim i na vyznamu .' b'podle mezinarodniho prava je svrchovany rad maltezskych rytiru suverennim nestatnim subjektem bez vlastniho uzemi ( mezi znalci mezinarodniho prava neni tento nazor zcela jednoznacny ) .' ....

What exactly is the issue? I also trained a much bigger train file with greater epoch size...but the issue is still the same.

Thanks in advance.

jbellic commented 3 years ago

any updates on this? I followed all steps but the result is not reproducible. There is also an issue on the used transformer library version - prediction_step function is only available on newer versions so the code does not match with the lib versions within the requirements file.

arahusky commented 3 years ago

I am sorry, but I am quite busy these days. I hope, I will have a look at it this week.

arahusky commented 3 years ago

I hope I have fixed the issue in the recent commit (https://github.com/ufal/bert-diacritics-restoration/commit/bf636ca2f49b46df60995b100bce259a74ff5488). The issue was introduced when refactoring the old code for final code to be published -- sorry for that and thank you for reporting! :)

Regarding the Transformers version -- which one does work for you?

jbellic commented 3 years ago

Works perfectly fine, even with the latest transformers version after doing some minor refactoring. Thank you very much.