kanyun-inc / fairseq-gec

Source code for paper: Improving Grammatical Error Correction via Pre-Training a Copy-Augmented Architecture with Unlabeled Data
Other
245 stars 67 forks source link

How to make data/train_merge and data/valid? #27

Closed soyoung97 closed 4 years ago

soyoung97 commented 4 years ago

Hello, thank you for your great work. I am trying to replicate this work and do further work, using this copy-attention model for GEC as a baseline model.

Currently I am trying to switch the dataset (to Korean). I already have the preprocessed data.

But, I have trouble making the input for the model. I think that I have to make things that are in out/data_bin, which are [train/valid].src-tgt.[src/tgt].[bin/idx] and [train/valid].label.[src/tgt].txt.

By analyzing the code, I found that I can make the label file and the binary file by running preprocess.sh. But I always get the error: FileNotFoundError: [Errono 2] No such file or directory: 'data/train_merge.src' by further analyzing your code, I found that we need "trainpref" and "validpref", which are listed as 'data/train_merge' and 'data/valid', to generate label and binary file. but I couldn't find the code that generates this, which means that I have to make this by myself.

My question is this. Overall: How can I make input for training the model?

  1. How can I make data/train_merge.src, data/train_merge.tgt, data/valid, and the alignfile(data/train_merge.forward)? What are the formats? (example: dict.src.txt have [word] [frequency] format of the training data)
  2. Is there other files that I need other than data/train_merge.[src/forward], data/valid, and dicts/dict.src.txt, In order to run preprocess.sh and get all the inputs needed?

Also, it will be so much helpful if you tell me the general process about how to run the model with different preprocessed dataset. Here, "preprocessed" means that I have done all of this(https://github.com/zhawe01/fairseq-gec/issues/14) to make training data, and I now have a clean sentence pair of [grammatically correct/ grammatically incorrect] dataset.

Thank you for reading my question. I will be waiting for your answer.

soyoung97 commented 4 years ago

After searching more about fairseq implementation, I have figured out what to put in data directory, but I still have questions about the align file. Can you tell me how you made the align file, and what the formats are?

zhawe01 commented 4 years ago

I use the "scripts/build_sym_alignment.py" script. Each line of the alignment file looks like this: 0-0 2-1 3-2 4-3 5-4 6-5 7-6 8-7 9-8 10-9 11-10 12-11

soyoung97 commented 4 years ago

Thank you for your answer :) I will try to make the alignment file by build_sym_alignment.py

HelenaHlz commented 4 years ago

Thank you for your answer :) I will try to make the alignment file by build_sym_alignment.py

Have you succeeded make the alignment file by build_sym_alignment.py?

soyoung97 commented 4 years ago

I have succeeded in making outputs by mosesdecoder, but I think there is something wrong with symal. It gives me symal: computing grow alignment: diagonal (1) final (1)both-uncovered (1), but anyways I get the output($trainpref.backward and $trainpref.forward).