makcedward / nlpaug

Data augmentation for NLP
https://makcedward.github.io/
MIT License
4.44k stars 463 forks source link

Unable to augment Chinese, given corresponding language model. #231

Open beyondguo opened 3 years ago

beyondguo commented 3 years ago

Using Context-word-enbedding augmenter:

English:

text = 'hi how are you'
context_aug = naw.ContextualWordEmbsAug(
    model_path='bert-base-uncased', action="substitute")
augmented_text = context_aug.augment(text)
print("Original:")
print(text)
print("Augmented Text:")
print(augmented_text)

output is ok:

Original:
hi how are you
Augmented Text:
hi how some you

But for Chinese:

text = '咋就不行了?'
context_aug = naw.ContextualWordEmbsAug(
    model_path='hfl/chinese-roberta-wwm-ext', action="substitute")
augmented_text = context_aug.augment(text)
print("Original:")
print(text)
print("Augmented Text:")
print(augmented_text)

output is exactly the same with my input:

Original:
咋就不行了?
Augmented Text:
咋 就 不 行 了 ?

I have checked to source code and don't know what's going wrong, could you please help me? Thanks a lot!!!

beyondguo commented 3 years ago

Update: I found that using bert-base-multilingual-uncased will be fine:

text = '咋就不行了?'
context_aug = naw.ContextualWordEmbsAug(
    model_path='bert-base-multilingual-uncased', action="substitute")
augmented_text = context_aug.augment(text)
print("Original:")
print(text)
print("Augmented Text:")
print(augmented_text)
>>>>>>>>
Original:
咋就不行了?
Augmented Text:
咋 就 通 行 了 ?

So what's the problem with hfl/chinese-roberta-wwm-ext?

makcedward commented 3 years ago

let use "bert-base-multilingual-uncased" as a workaround. will look into "hfl/chinese-roberta-wwm-ext" model.

rajah commented 2 years ago

Even though they are roberta models, they have to be loaded using bert type. model_type='bert' as a param will fix the issue.