makcedward / nlpaug

Data augmentation for NLP
https://makcedward.github.io/
MIT License
4.46k stars 463 forks source link

Augmenting a sentence is not persisting whitespaces between certain punctuation #272

Open pedrojlazevedo opened 2 years ago

pedrojlazevedo commented 2 years ago

Hello.

Reproduce:

import nlpaug.augmenter.char as nac 

print(nac.KeyboardAug().augment("Hello . Test  ? Testing ! And this : Not this + "))

This will output the following:

output: "Hello. TeZ5? TesrKny! And tgiD: Not tyiq +""

You can see that the whitespaces are being removed. I tried to use a specific tokenizer (Whitespace) passing the function below without success.

from tokenizers.pre_tokenizers import Whitespace
def whitespace_tokenizer(text, tokenizer=Whitespace()):
    tokens_tuple = tokenizer.pre_tokenize_str(text)
    return list(map(list, zip(*tokens_tuple)))[0]

Is this intentional? If not, what is the workaround?

__ Pedro Azevedo

makcedward commented 2 years ago

By default, the tokenizer intends to handle as you mentioned. You may check here for implementation.

If it does not match your use case, you can pass custom tokenzier and reverse_tokenzier and pass it to the augmenter. Here is the sample code.

import nlpaug.augmenter.char as nac 

def custom_tokenzier(text):
    ...

def custom_reverse_tokenizer(text):
    ...
print(nac.KeyboardAug(tokenizer=custom_tokenzier, reverse_tokenizer=custom_reverse_tokenizer).augment("Hello . Test  ? Testing ! And this : Not this + "))