Open pedrojlazevedo opened 2 years ago
By default, the tokenizer intends to handle as you mentioned. You may check here for implementation.
If it does not match your use case, you can pass custom tokenzier and reverse_tokenzier and pass it to the augmenter. Here is the sample code.
import nlpaug.augmenter.char as nac
def custom_tokenzier(text):
...
def custom_reverse_tokenizer(text):
...
print(nac.KeyboardAug(tokenizer=custom_tokenzier, reverse_tokenizer=custom_reverse_tokenizer).augment("Hello . Test ? Testing ! And this : Not this + "))
Hello.
Reproduce:
This will output the following:
You can see that the whitespaces are being removed. I tried to use a specific tokenizer (Whitespace) passing the function below without success.
Is this intentional? If not, what is the workaround?
__ Pedro Azevedo