grammarly / gector

Official implementation of the papers "GECToR – Grammatical Error Correction: Tag, Not Rewrite" (BEA-20) and "Text Simplification by Tagging" (BEA-21)
Apache License 2.0
900 stars 214 forks source link

Training and predictiong corrects nothing on custom dataset #131

Closed ierezell closed 3 years ago

ierezell commented 3 years ago

I tried all the steps as said in the readme on a custom dataset.

Preprocessing correctly creates the output file with each line looking like $STARTSEPL|||SEPR$KEEP JeSEPL|||SEPR$KEEP suisSEPL||| etc.....

with

python utils/preprocess_data.py -s my_source_corpus_lines.txt -t  my_target_corpus_lines.txt -o output.txt

Then training with (note that my_model_dir was empty at start):

 python train.py --train_set output.txt --dev_set  output.txt  --model_dir my_model_dir  --tune_bert 1

and then testing with :

python predict.py --model_path ./my_model_dir/best.th --vocab_path ./my_model_dir/vocabulary --input_file my_source_corpus_lines.txt --output_file res.txt

And got Produced overall corrections: 0 both files are the same, none are corrected....

Thanks in advance for any help Have a great day

skurzhanskyi commented 3 years ago

Did you used parameters from training_parameters.md? The majority of labels in train data usually $KEEP tags, so a badly-trained model will predict only $KEEP, which corresponds to no corrections.

ierezell commented 3 years ago

Hi @skurzhanskyi,

I changed some parameters, trained with

 python train.py --train_set ./data/corpus/output.txt --dev_set ./data/corpus/output.txt  --tune_bert 1  --skip_correct 1 --skip_complex 0  --max_len 50 --batch_size 2  --tag_strategy keep_one  --cold_steps_count 0  --cold_lr 1e-3  --lr 1e-5  --predictor_dropout 0.0  --lowercase_tokens 0  --pieces_per_token 5  --label_smoothing 0.0  --model_dir ./models  --accumulation_size 4 --n_epoch 2 --cold_steps_count 2  --updates_per_epoch 10000 --tn_prob 0  --tp_prob 1  --transformer_model roberta  --special_tokens_fix 1

And predicted with

python predict.py --model_path ./models/best.th --vocab_path ./models/vocabulary --input_file ./data/corpus/source.txt --output_file ./data/results/res.txt --iteration_count 10 

also tried

python predict.py --model_path ./models/best.th --vocab_path ./models/vocabulary --input_file ./data/corpus/source.txt --output_file ./data/results/res.txt --iteration_count 5  --additional_confidence 0.2  --min_error_probability 0.5 

But yielded poorer results

There is still some easy error that the model is not fixing but I should make a longer training with pre-training and stuff (this was just a POC/sanity check) but at least it's better. However, it seems really sensitive to parametrization.

I guess we can close the issue as it is only trying to find the best parameters for my use-case.

Thanks for your help, Have a great day.

alan-ai-learner commented 2 years ago

@Ierezell have you solve this issue, can you look into mine #142 . thanks