rgcottrell / pytorch-human-performance-gec

A PyTorch implementation of "Reaching Human-level Performance in Automatic Grammatical Error Correction: An Empirical Study"
Apache License 2.0
50 stars 19 forks source link

pytorch-human-performance-gec

The goal of this project is to implement a grammatical error correction model from paper "Reaching Human-level Performance in Automatic Grammatical Error Correction: An Empirical Study" using PyTorch and fairseq.

While the original paper achieves human performance, this implementation is more about an empirical study on applying deep learning in NLP as a university team project.

Project Team

This project was completed as the final project for CS 410: Text Information Systems at the University of Illinois at Urbana-Champaign. The team members and their primary areas of responsibility were:

A video presentation is available online and the slides are included in this repository. The project Technology Review is also available.

What We Learned

Empirical Study

Completed

Iteration 0 O In the world oil price very high right now . H In the world oil price very high right now . 0 Fluency Score: 0.1503

Iteration 1 O In the world oil price very high right now . H In the world oil prices very high right now . -0.2768438458442688 Fluency Score: 0.1539 Iteration 1 O In the world oil price very high right now . H In the world oil prices are very high right now . -0.31139659881591797 Fluency Score: 0.1831 Iteration 1 O In the world oil price very high right now . H In the world oil price is very high right now . -0.3594667315483093 Fluency Score: 0.1731 Iteration 1 O In the world oil price very high right now . H In the world oil price very expensive right now . -0.4148099422454834 Fluency Score: 0.1434

Best inference "In the world oil prices are very high right now ." (0.1831)

- Boost inference has been implemented to use both base model and language model. For example, entered sentence is corrected in multiple ways, the best scored one is chosen for multiple rounds of correction, until the score cannot be improved.

In the world oil price very high right now .

Iteration 0 O In the world oil price very high right now . H In the world oil price very high right now . 0 Fluency Score: 0.1503

Iteration 1 O In the world oil price very high right now . H In the world oil prices very high right now . -0.2768438458442688 Fluency Score: 0.1539 Iteration 1 O In the world oil price very high right now . H In the world oil prices are very high right now . -0.31139659881591797 Fluency Score: 0.1831 Iteration 1 O In the world oil price very high right now . H In the world oil price is very high right now . -0.3594667315483093 Fluency Score: 0.1731 Iteration 1 O In the world oil price very high right now . H In the world oil price very expensive right now . -0.4148099422454834 Fluency Score: 0.1434

Boost inference from "In the world oil prices are very high right now ." (0.1831)

Iteration 2 O In the world oil prices are very high right now . H In the world oil prices are very expensive right now . -0.3672739863395691 Fluency Score: 0.1690 Iteration 2 O In the world oil prices are very high right now . H In the world oil prices are very high now . -0.4246770739555359 Fluency Score: 0.1883 Iteration 2 O In the world oil prices are very high right now . H The world oil prices are very high right now . -0.42579686641693115 Fluency Score: 0.1770 Iteration 2 O In the world oil prices are very high right now . H In the world oil prices are very high right now , -0.6304754018783569 Fluency Score: 0.1748

Boost inference from "In the world oil prices are very high now ." (0.1883)

Iteration 3 O In the world oil prices are very high now . H In the world oil prices are very expensive now . -0.41596412658691406 Fluency Score: 0.1693 Iteration 3 O In the world oil prices are very high now . H The world oil prices are very high now . -0.45905303955078125 Fluency Score: 0.1780 Iteration 3 O In the world oil prices are very high now . H In world oil prices are very high now . -0.47978001832962036 Fluency Score: 0.1718 Iteration 3 O In the world oil prices are very high now . H In the world oil prices are very high now , -0.6376678347587585 Fluency Score: 0.1780

Best inference "In the world oil prices are very high now ." (0.1883)

- Evaluation of JFLEG test set using GLEU score.
  - The base model has a GLEU score 48.17 on JFLEG test set when it was trained for 2 epochs.
  - The base model has a GLEU score 48.89 on JFLEG test set when it was trained for 3 epochs.
  - The introduction of boost inference increased GLEU from 48.89 to 49.39. The percentage of increase is consistent with the paper ( ≈ 1% ).
- An enhanced interactive mode with RESTful API and Web GUI.
  - RESTful API
  - ![RESTful API](raw/restful-api.png?raw=true "RESTful API")
  - Web GUI
  - ![Web GUI](raw/web-gui.png?raw=true "Web GUI")
  - Web GUI 2
  - ![Web GUI 2](raw/web-gui-2.png?raw=true "Web GUI 2")

### Not Completed

- A right-to-left convolutional seq2seq model.
- Training of the model using new dataset generated by boost learning.
- BPE tokenization and unknown token replacement.
- Stemming for raw text / interactive modes.
- Evaluation of CoNLL-10 test set using F0.5.

### Barriers

- No enough training dataset: we got only Lang-8 corpus, which is about 20% of the dataset used by the paper. We've tried to contact other organizations, but no response was received.
- Limit time: training big neural network model with large dictionaries is time consuming. We decided to complete the whole process instead of trying to reproduce the score achieved by the paper. We can probably achieve it given more time and datasets.

## Initialize Submodules

After checking out the repository, be sure to initialize the included git submodules:

```sh
git submodule update --init --recursive

The reasons of using them as submodules rather than Python package are:

Install Required Dependencies

The environment used for the development is Windows 10 64bit + Python 3.6 + CUDA 9.2 + pytorch 0.4.1.

PyTorch can be installed by following the directions on its project page. Conda is recommended as it will install CUDA dependencies automatically. For example,

conda install pytorch cuda92 -c pytorch
pip3 install torchvision

This project also uses the fairseq NLP toolkit, which is included as a submodule in this repository. To prepare the library for use, make sure that it is installed along with its dependencies.

cd fairseq
pip3 install -r requirements.txt
python setup.py build develop

Some preprocessing scripts also make use of the NLTK framework, which can be installed with this command:

pip3 install nltk

Once the NLTK framework is installed, the punkt dataset must also be downloaded. This can be done from the Python REPL:

python
>>> import nltk
>>> nltk.download('punkt')

Other project dependencies are placed under fairseq-scripts folder, which can be installed by running

cd fairseq-scripts
pip3 install -r requirements.txt

Folder Structures

.
|-- OpenNMT-py                  The other NLP toolkit we tried early (legacy)
|-- checkpoints                 Trained models and checkpoints
|   |-- errorgen-fairseq-cnn        An error generation model that takes corrected sentences as input,
|   |                               uncorrected sentences as output
|   |-- lang-8-fairseq              A simple single layer LSTM model for error correction
|   `-- lang-8-fairseq-cnn          A 7-layer convolutional seq2seq model for error correction
|-- corpus                      Raw and prepared corpus
|   |-- errorgen-fairseq            Corpus generated by the error generation model - the result of boost learning.
|   |-- lang-8-en-1.0               Raw Lang-8 corpus
|   |-- lang-8-fairseq              Corpus format required by fairseq
|   `-- lang-8-opennmt              Corpus format required by OpenNMT
|-- data-bin                    Pre-processed and binarized data
|   |-- errorgen-fairseq            Binarized synthetic data and dictionaries
|   |-- lang-8-fairseq              Binarized Lang-8 data and dictionaries
|   `-- wiki103                     Pre-trained WikiText-103 language model and dictionaries
|-- doc                         Additional project research and documentation
|-- fairseq                     fairseq submodule
|-- fairseq-scripts             fairseq scripts used to implement the model and process proposed by the paper
|-- opennmt                     OpenNMT data and model folder (legacy)
|-- opennmt-scripts             OpenNMT scripts folder (legacy)
`-- test                        Random test text files can be thrown to here

fairseq Custom Scripts / Software Usage Tutorial

All fairseq scripts have been grouped under fairseq-scripts folder. The whole process is:

  1. Preparing data
  2. Pre-process data
  3. Train the model
  4. Testing the model
  5. Evaluate the model
  6. Interactive mode
  7. Boosting

Preparing Data

The first step is to prepare the source and target pairs of training and validation data. Extract original lang-8-en-1.0.zip under corpus folder. Then create another folder lang-8-fairseq under corpus folder to store re-formatted corpus.

To split the Lang-8 learner data training set, use the following command:

python transform-lang8.py -src_dir <dataset-src> -out_dir <corpus-dir>

To split the CLC-FCE data set training set, use the following command:

python transform-CLC_FCE.py -src_dir <dataset-src> -out_dir <corpus-dir>

These scripts will create training, validation, and test sets for both the left-to-right and right-to-left models.

Pre-process Data

Once the data has been extracted from the dataset, use fairseq to prepare the training and validation data and create the vocabulary:

preprocess-lang8.bat

Train the Model

To train the error-correcting model, run the following command:

train-lang8-cnn.bat

Note that this script may need to be adjusted based on the GPU and memory resources available for training.

Testing the Model

To test the model, run the following command to try to correct a list of sentences:

generate-lang8-cnn.bat

This command will try to correct all sentences in a file with probabilities and scores in the output. It is a convenient way to peed that the model behaves as expected against lots of test data.

Evaluate the Model

Evaluate scripts are used to score model using text or pre-processed files in batch.

Evaluate can be done against lang-8 test dataset using

generate-lang8-cnn-rawtext.bat

The paper evaluates against JFLEG test dataset, which can be done using

generate-jfleg-cnn-rawtext.bat

Above scripts can be modified to test other test dataset easily as they use plain text.

Other scripts such as generate-lang8.bat or generate-lang8-cnn.bat can only deal with pre-processed data so it is less convenient.

Interactive Mode

While evaluate scripts are good at batch processing, two interactive scripts are provided to see details of generation / correction.

Below script will run in console mode:

interactive-lang8-cnn-console.bat

Below script will boot a local server to provide a web GUI and RESTful API interface:

interactive-lang8-cnn-web.bat

Interactive mode allows users to enter a sentence in console, or Web GUI, to see how subtle difference in input are corrected.

Boosting

To augment training data to provide more examples of common errors, this project builds an error-generating model that can produce additional lower quality sentences for correct sentences. This uses the same training data as the regular model, but reverses the source and target sentences.

Once the data has been extracted from the dataset, use fairseq to prepare the training and validation data and create the vocabulary:

preprocess-errorgen.bat

To train the error-correcting model, run the following command:

train-errorgen-cnn.bat

Note that this script may need to be adjusted based on the GPU and memory resources available for training.

Now the error-generating model can be use to generate additional training data. The generating script will only consider sentences longer than four words that are at least 5% less fluent (as measured by the fluency scorer) than the corrected sentences. This ensures that the new sentences are more likely to showcase interesting corrections while avoiding trivial edits. Notice that in this case we use the latest model checkpoint rather than the most generalized, because in this particular case overfitting to the training data is an advantage!

generate-errorgen-cnn.bat

The sentences generated in the corpus\errorgen directory can then be used as additional data to train or fine tune the error-correcting model.

Additional Techniques

In addition to the work describe above, additional datasets and techniques for data preprocessing, model training, and other imporovements were evaluated.

BPE tokenization promises to make more effective use of a limited number of vocabulary tokens by further subdividing words into subword tokens that can be shared by many different words. Additional documentation and notebook showing how to install and tokenize the dataset is available.

An example of sentences after apply BPE tokenization can be seen below:

I will introduce my dog , Ti@@ ara .
She is a cheerful and plu@@ mp pretty dog , perhaps she is the cu@@ test dog in the world .
She 's an 8 year old golden re@@ tri@@ ever
Her fu@@ r is a beautiful a@@ mber colour and is soft .
When she has had her food , she always pr@@ ances around the living room mer@@ ri@@ ly .
And she loves ba@@ s@@ king too .

Patching fairseq Environment

If error AttributeError: function 'bleu_zero_init' not found occurs on Windows, modify functions to have __declspec(dllexport) then build again. See Issue 292

If error UnicodeDecodeError: 'charmap' codec can't decode byte error occurs, modify fairseq/tokenizer.py to include , encoding='utf8' for all open functions.

When trying built-in example from fairseq/examples/translation/prepare-[dataset].sh, scripts may need to change .py path from $BPEROOT/[script].py to $BPEROOT/subword_nmt/[script].py.

OpenNMT (Legacy)

Initial exploration and implentation of this project used the OpenNMT library. Documentation of how to use this library is included below. An Framework Comparison comparing the two frameworks and why we settled on Fairseq is available.

OpenNMT Scripts

All OpenNMT scripts have been grouped under opennmt-scripts folder.

Preparing Data

The first step is to prepare the source and target pairs of training and validation data. Extract original lang-8-en-1.0.zip under corpus folder. Then create another folder lang-8-opennmt under corpus folder to store re-formatted corpus.

To split the Lang-8 learner data training set, use the following command:

python transform-lang8.py -src_dir <dataset-src> -out_dir <corpus-dir>

e.g.

python transform-lang8.py -src_dir ../corpus/lang-8-en-1.0 -out_dir ../corpus/lang-8-opennmt

Once the data has been extracted from the dataset, use OpenNMT to prepare the training and validation data and create the vocabulary:

preprocess-lang8.bat

Train the Model

To train the error-correcting model, run the following command:

train.bat

Note that this script may need to be adjusted based on the GPU and memory resources available for training.

Testing the Model

To test the model, run the following command to try to correct a list of sentences:

translate.bat

After the sentences have been translated, the source and target sentence may be compared side to side using the following command:

python compare.py

Patching OpenNMT-py Environment

If preprocess.py fails with a TypeError, then you may need to patch OpenNMT-py.

Update OpenNMT-py\onmt\inputters\dataset_base.py with the following code:

def __reduce_ex__(self, proto):
    "This is a hack. Something is broken with torch pickle."
    return super(DatasetBase, self).__reduce_ex__(proto)

If TypeError: __init__() got an unexpected keyword argument 'dtype' occurs, pytorch/text installed by pip may be out of date. Update it using pip3 install git+https://github.com/pytorch/text

If RuntimeError: CuDNN error: CUDNN_STATUS_SUCCESS occurs during training, try install pytorch with CUDA 9.2 using conda instead of using default CUDA 9.0.