grammarly / gector

Official implementation of the papers "GECToR – Grammatical Error Correction: Tag, Not Rewrite" (BEA-20) and "Text Simplification by Tagging" (BEA-21)
Apache License 2.0
891 stars 216 forks source link

Question about vocab generation #152

Closed shgabr closed 2 years ago

shgabr commented 2 years ago

We extracted the encoder part from the T5 model, and we have been successful in training the model and its results are pretty decent. We have been able to train the model successfully when we don't supply a vocab path, which means the model generated its own vocab.

The problem is that the extracted vocab is horrible, and your vocab seems much better. So we wanted to train our T5 encoder model using your vocab, but doing so results in this error.

ERROR:allennlp.data.vocabulary:Namespace: d_tags
ERROR:allennlp.data.vocabulary:Token: INCORRECT
Traceback (most recent call last):
  File "./modifiedGector/train.py", line 311, in <module>
    main(args)
  File "./modifiedGector/train.py", line 127, in main
    special_tokens_fix=args.special_tokens_fix)
  File "./modifiedGector/train.py", line 85, in get_model
    confidence=confidence)
  File "/home/cse/gector/modifiedGector/gector/seq2labels_model.py", line 76, in __init__
    namespace=detect_namespace)
  File "/home/cse/.conda/envs/gector/lib/python3.7/site-packages/allennlp/data/vocabulary.py", line 630, in get_token_index
    return self._token_to_index[namespace][self._oov_token]
KeyError: '@@UNKNOWN@@'

We suspect that the error is because the model didn't tokenize the training data specifically for the t5 model, but we don't know how to do so while using your vocabulary.

We tried removing the d_tags file, but again the same error.

Any help or advice would be highly appreciated.

mina1460 commented 2 years ago

I tried it on my own and I got the same results. Here is another issue that seems to have the same problem https://github.com/allenai/allennlp/issues/881#issuecomment-1016202077

shgabr commented 2 years ago

Problem solved

shiningliang commented 2 years ago

Problem solved

Hi @shgabr , I'm facing the same problem. How did you solve it? Thanks!

CoderBinGe commented 2 years ago

Problem solved

Hi @shgabr , I'm facing the same problem. How did you solve it? Thanks!

1783696285 commented 1 year ago

image You need to use the "codecs " library, when you create the labels in output_vocabulary.