grammarly / gector

Official implementation of the papers "GECToR – Grammatical Error Correction: Tag, Not Rewrite" (BEA-20) and "Text Simplification by Tagging" (BEA-21)
Apache License 2.0
900 stars 214 forks source link

Help with script inputs, training, predicting and evaluation #138

Closed eedenong closed 2 years ago

eedenong commented 2 years ago

Hi! I have been running into the issue that after following the steps to train and predict as highlighted in the README, the evaluation scores are quite poor. I am not sure if it is due to misformatting the training data, so I would like to seek some help here! Here are the steps that I took to carry out the training, prediction, and evaluation:

All of this was done on Google Colab.

Data preprocessing I used the FCE dataset to generate the train and dev sets, specifically

  1. Use the error.py script from the PIE repo (https://github.com/awasthiabhijeet/PIE/tree/master/errorify) to generate the parallel text files correct.txt and incorrect.txt

Screen Shot 2021-11-11 at 00 03 03 AM

  1. Use the preprocess_data.py from the GECToR repo to generate the output files (train.txt and dev.txt respectively) Screen Shot 2021-11-11 at 00 03 41 AM

Model training Then, I trained the model using the generated train.txt and dev.txt My Google Colab runtime timed out and I got the following: Screen Shot 2021-11-10 at 19 23 27 PM

Prediction and evaluation Afterwards, I ran the prediction script using a txt file, train_incorr_sentences.txt from the PIE repository (https://github.com/awasthiabhijeet/PIE/tree/master/scratch), to obtain the predictions as preds_output.m2. The model path specified was pointing to the best.th file in the model_output folder Screen Shot 2021-11-11 at 00 03 57 AM

Then, I used the two parallel text files from the same PIE folder (train_incorr_sentences.txt and train_corr_sentences.txt) to generate a reference file ref_output.m2 Screen Shot 2021-11-11 at 00 04 06 AM

Then, I ran the m2scorer script with the SYSTEM argument set as preds_output.m2 and SOURCE_GOLD as ref_output.m2 Screen Shot 2021-11-11 at 00 04 56 AM

These were the resulting scores (after training once i.e. stage1): Precision: 0.0831 Recall: 0.0780 F0.5: 0.0820

I am not sure if I am using the wrong datasets and passing them into the wrong scripts, as there isn't much documentation specifying exactly what kind of files and the format of the files to pass in. It would be a very big help if someone could help to point me in the correct direction of the specifics of what kind of data I should be using for each step, and if I am processing them correctly!

I also read that you did 3 stages of training, is this expected behaviour after the first stage of training?

skurzhanskyi commented 2 years ago

Hi @eedenong

Please take a look at the corresponding README sections if you want to reproduce the results in the paper.

  1. The data for the first stage could be found here, as it was mentioned in the Dataset section.
  2. It looks like you're using default parameters for training. We explained in detail our parameters at each stage here.
  3. From what I see, you used preprocess_data.py correctly (errorful data as a source and error-free data as a target). You may also want to look at similar issues (#136, #104, #53).
  4. You can take a look at our scores after each stage in Table 4 in the paper.
eedenong commented 2 years ago

Thank you, I will take a look at them!

Just to clarify, for the model inference input file, should it be in the m2 format or the txt format, and should it be a dataset of incorrect sentences to be corrected? And in this case, will it suffice for me to simply use a dataset of incorrect sentences for example a1_train_incorr_sentences.txt from the PIE synthetic dataset? So far the issues that I have seen only refer to the formats of the text files with regards to the preprocessing.

skurzhanskyi commented 2 years ago

If you're talking about predict.py, it takes model input that is incorrect sentences. For the prediction stage, the model shouldn't require correct output as part of the input. Thus m2 and preprocess_data.py formats don't fit here.

eedenong commented 2 years ago

For the prediction stage, the model should require correct output as part of the input.

Regarding this, may I know which of the input are you referring to? Are you referring to the --output_file argument that is passed into predict.py, or do you mean that the correct output should be in the same text file as the input file into --input_file for precict.py?

skurzhanskyi commented 2 years ago

Oh, sorry. I meant For the prediction stage, the model shouldn't require correct output as part of the input

eedenong commented 2 years ago

I see, thank you! I have another query:

The data for the first stage could be found here, as it was mentioned in the Dataset section.

May I clarify if I am supposed to generate the 98/2 train/dev split from the file generated from the preprocess.py? Or am I supposed to find separate train/dev sets and preprocess them separately to generate the train and dev sets?

skurzhanskyi commented 2 years ago

The results will be the same