Closed IstiaqAnsari closed 2 years ago
Hello These words were used to fix grammatical errors in the training data. Indeed, as we need to have a limited number of labels, not all words are present here – only those with the highest frequency in training data. That means that people often tend to make errors in such words (like spelling).
I don't think this question is an issue about the code, but I wasn't sure where to ask this. Out of 5000 vocabulary of edit operation prediction there are 3802 random words that are replacing original words in a sentence with grammar error. But the Replace tokens are very random and looks like there just covering the English word vocabulary. For example $REPLACE_electric $REPLACE_sister $APPEND_car $REPLACE_fantastic $REPLACE_examination $APPEND_city $REPLACE_eaten These words are very random and have no grammatical significance. Why are these words in the output space? What are the reasons behind this? If they are there, shouldn't every word in english be in the output space? I am guessing if the English vocabulary size is too large that's why only the words with highest frequency in English language have been selected for this operation. Could that be the only reason?