Augmenting a sentence is not persisting whitespaces between certain punctuation

makcedward / nlpaug

Data augmentation for NLP

MIT License

4.46k stars 463 forks source link

Hello.

Reproduce:

import nlpaug.augmenter.char as nac 

print(nac.KeyboardAug().augment("Hello . Test  ? Testing ! And this : Not this + "))

This will output the following:

output: "Hello. TeZ5? TesrKny! And tgiD: Not tyiq +""

You can see that the whitespaces are being removed. I tried to use a specific tokenizer (Whitespace) passing the function below without success.

from tokenizers.pre_tokenizers import Whitespace
def whitespace_tokenizer(text, tokenizer=Whitespace()):
    tokens_tuple = tokenizer.pre_tokenize_str(text)
    return list(map(list, zip(*tokens_tuple)))[0]

Is this intentional? If not, what is the workaround?

__ Pedro Azevedo

import nlpaug.augmenter.char as nac def custom_tokenzier(text): ... def custom_reverse_tokenizer(text): ... print(nac.KeyboardAug(tokenizer=custom_tokenzier, reverse_tokenizer=custom_reverse_tokenizer).augment("Hello . Test ? Testing ! And this : Not this + "))

makcedward / nlpaug

Augmenting a sentence is not persisting whitespaces between certain punctuation #272