Closed Arvinth-s closed 2 years ago
Hi @Arvinth-s
In order to build your own vocabulary, you should not specify the existing vocab_path
for English.
https://github.com/grammarly/gector/blob/master/train.py#L114
Hi @Arvinth-s In order to build your own vocabulary, you should not specify the existing
vocab_path
for English. https://github.com/grammarly/gector/blob/master/train.py#L114
I didn't specify the vocab_path
. The model generated a new vocabulary. But that doesn't contain any Tamil words or characters.
The error was in the synthetic data generation part.
I trained the model on the Oscar Tamil dataset. I changed the
weights_name
intrain.py
andpredict.py
as discussed in #94. I tried the default RoBERTA model. I also tried XLM (xlm-roberta-base), which was pre-trained on a multilingual dataset that includes Tamil. I used errant for preprocessing the dataset. I trained the model for three epochs with 10000 updates_per_epochThe generated vocabulary contains only English words and special characters and no Tamil words or characters. Also, the model doesn't predict any correction on the dataset.
Preprocessed data looks like this.
labels.txt in vocabulary looks like this