Closed jbellic closed 3 years ago
any updates on this? I followed all steps but the result is not reproducible. There is also an issue on the used transformer library version - prediction_step function is only available on newer versions so the code does not match with the lib versions within the requirements file.
I am sorry, but I am quite busy these days. I hope, I will have a look at it this week.
I hope I have fixed the issue in the recent commit (https://github.com/ufal/bert-diacritics-restoration/commit/bf636ca2f49b46df60995b100bce259a74ff5488). The issue was introduced when refactoring the old code for final code to be published -- sorry for that and thank you for reporting! :)
Regarding the Transformers version -- which one does work for you?
Works perfectly fine, even with the latest transformers version after doing some minor refactoring. Thank you very much.
Hi,
I've spent some time for training a small cs dataset (downloaded from given link) by following these steps:
(for the sake of time all input/target train,dev and test files point to the same files) INPUT_TRAIN_FILE="data/target_train_small_stripped.txt" TARGET_TRAIN_FILE="data/target_train_small.txt" INPUT_DEV_FILE="data/target_train_small_stripped.txt" TARGET_DEV_FILE="data/target_train_small.txt" INPUT_TEST_FILE="data/target_train_small_stripped.txt" TARGET_TEST_FILE="data/target_train_small.txt" LABELS="data/subwords.txt" OUTPUT_DIR="checkpoints" DATA_DIR="cache" NUM_EPOCHS=1 BERT_MODEL=bert-base-multilingual-uncased TOKENIZER_NAME=bert-base-multilingual-uncased MAX_LENGTH=128 BATCH_SIZE=64 GRAD_ACC_STEPS=32 SAVE_STEPS=400 SEED=1
(not necessary but still needed by arg parser) INPUT_TRAIN_FILE="data/target_train_small_stripped.txt" TARGET_TRAIN_FILE="data/target_train_small.txt" INPUT_DEV_FILE="data/target_train_small_stripped.txt" TARGET_DEV_FILE="data/target_train_small.txt" INPUT_TEST_FILE="data/target_train_small_stripped.txt" TARGET_TEST_FILE="data/target_train_small.txt"
TOKENIZER_NAME=bert-base-multilingual-uncased MAX_LENGTH=128 BERT_MODEL=bert-base-multilingual-uncased
python run_diacritization.py \ --data_dir $DATA_DIR \ --model_name_or_path $MODEL \ --tokenizer_name $TOKENIZER_NAME \ --output_dir $MODEL \ --per_device_eval_batch_size 1 \ --max_seq_length $MAX_LENGTH \ --cache_dir $DATA_DIR \ --do_predict \ --prediction_file_path $IN_FILE > $OUT_FILE \ --input_train_file "$INPUT_TRAIN_FILE" \ --target_train_file "$TARGET_TRAIN_FILE" \ --input_dev_file "$INPUT_DEV_FILE" \ --target_dev_file "$TARGET_DEV_FILE" \ --input_test_file "$INPUT_TEST_FILE" \ --target_test_file "$TARGET_TEST_FILE"
The output was: b'roku 1522 oblehal osmansky sultan sulejman i .' b'rhodos ( k dispozici mel 400 lodi a 200 tisic vojaku , zatimco rad disponoval jen 7 tisici rytiri ) .' b'po sestimesicnim oblezeni byl rytirum povolen odchod z ostrova .' b'roku 1530 jim cisar svate rise rimske a spanelsky kral karel v. a papez klement viii . udelili v leno souostrovi malty za symbolickou povinnost odvadet jednoho maltskeho sokola rocne .' b'k vyrazne dulezitejsi roli radu vsak patril boj proti muslimskym piratum , kteri ze svych malych pristavu na severoafrickem pobrezi znepokojovali velky pocet mest a osad na evropskem pobrezi stredozemniho more .' b'k nejznamejsim korzarum patril napr . chajruddin barbarossa ( zvany tez khair ad-din ) a jeho bratr oruc , pozdeji pirat dragut , kteri stali ve sluzbach osmanskych sultanu .' b'roku 1551 prepadli pirati ostrov gozo na malte a unesli do otroctvi 5000 az 6000 obyvatel ostrova ( temer veskere obyvatelstvo ) , 1561 zajali katanskeho biskupa niccolu carraciolu , ktery cestoval na galere maltezskeho radu .' b've velke a krvave bitve zde rad po dlouhem oblezeni 1565 zvitezil nad osmanskymi vojsky .' b'zalozil hlavni mesto a pevnost vallettu a opevnil cely ostrov ( pod vedenim tehdejsiho velmistra jean parisot de la valette ) .' b'v nove zrizene nemocnici byli osetrovani pacienti z cele stredomorske oblasti .' b'johanite vybudovali spravni aparat .' b'pod jejich vedenim se malta stala suverennim statem .' b'roku 1798 musel rad kapitulovat pred napoleonem bonaparte , tahnoucim se svou armadou na sve egyptske tazeni ( ridice se svym heslem \xe2\x80\x9enepozvedni mece proti bratru krestanovi \xe2\x80\x9c ) , a usadil se na nekolik desetileti v rusku , pricemz ztratil behem dalsich desetileti vetsinu svych prevorstvi ( tzn . pozemku a jmeni ) a tim i na vyznamu .' b'podle mezinarodniho prava je svrchovany rad maltezskych rytiru suverennim nestatnim subjektem bez vlastniho uzemi ( mezi znalci mezinarodniho prava neni tento nazor zcela jednoznacny ) .' ....
What exactly is the issue? I also trained a much bigger train file with greater epoch size...but the issue is still the same.
Thanks in advance.