Open eu9ene opened 3 months ago
There are also corpora that (maybe it was UN) do not have ideographic full stop character and they have ascii fullstop instead. That should be fixed.
Is it appropriate to do data augmentation at training time to swap between .
and 。
?
If the model has to be robust, probably it is a good thing to do? But in the case where chinese is target language, everything should be normalized to the ideographic punctuation.
Nikolay: The UN corpus doesn't contain fulstops (for example) and we use something like this to fix it:
(This script is integrated in the previous copy/paste of script).