google-research / uda

Unsupervised Data Augmentation (UDA)
https://arxiv.org/abs/1904.12848
Apache License 2.0
2.18k stars 312 forks source link

couldn't get the same accuracy of BERT when use no augmentation in UDA #18

Open hexiaoyupku opened 5 years ago

hexiaoyupku commented 5 years ago

Hi, thank you for your wonderful work! I tried UDA without any augmentation on my text classification task, I can only get 93% accuracy while BERT can get 96% accuracy with the same steps and learning rate. Is there any suggestions I can work on? Thanks!

hexiaoyupku commented 5 years ago

I fixed my problem by modifying the tokenizer. The tokenizer UDA used is not consistent with BERT pretrain model for Chinese. Before:

def tokenize_to_wordpiece(self, tokens):
    split_tokens = []
    for token in tokens:
        split_tokens += self.wordpiece_tokenizer.tokenize(token)
    return split_tokens

After:

def tokenize(self, tokens):
    text = ''.join(tokens)
    split_tokens = []
    for token in self.basic_tokenizer.tokenize(text):
        for sub_token in self.wordpiece_tokenizer.tokenize(token):
            split_tokens.append(sub_token)
    return split_tokens

Example: '东台市' is tokenized as ['东', '##台', '##市'] in UDA while ['东', '台', '市'] in BERT. After modifying the tokenizer, I got the same accuracy as BERT.

1024er commented 5 years ago

I fixed my problem by modifying the tokenizer. The tokenizer UDA used is not consistent with BERT pretrain model for Chinese. Before:

def tokenize_to_wordpiece(self, tokens):
    split_tokens = []
    for token in tokens:
        split_tokens += self.wordpiece_tokenizer.tokenize(token)
    return split_tokens

After:

def tokenize(self, tokens):
    text = ''.join(tokens)
    split_tokens = []
    for token in self.basic_tokenizer.tokenize(text):
        for sub_token in self.wordpiece_tokenizer.tokenize(token):
            split_tokens.append(sub_token)
    return split_tokens

Example: '东台市' is tokenized as ['东', '##台', '##市'] in UDA while ['东', '台', '市'] in BERT. After modifying the tokenizer, I got the same accuracy as BERT.

You can also try ERNIE on chinese text classification tasks.

daisy-disc commented 4 years ago

I fixed my problem by modifying the tokenizer. The tokenizer UDA used is not consistent with BERT pretrain model for Chinese. Before:

def tokenize_to_wordpiece(self, tokens):
    split_tokens = []
    for token in tokens:
        split_tokens += self.wordpiece_tokenizer.tokenize(token)
    return split_tokens

After:

def tokenize(self, tokens):
    text = ''.join(tokens)
    split_tokens = []
    for token in self.basic_tokenizer.tokenize(text):
        for sub_token in self.wordpiece_tokenizer.tokenize(token):
            split_tokens.append(sub_token)
    return split_tokens

Example: '东台市' is tokenized as ['东', '##台', '##市'] in UDA while ['东', '台', '市'] in BERT. After modifying the tokenizer, I got the same accuracy as BERT.

Where to modify this code?