Open hexiaoyupku opened 5 years ago
I fixed my problem by modifying the tokenizer. The tokenizer UDA used is not consistent with BERT pretrain model for Chinese. Before:
def tokenize_to_wordpiece(self, tokens):
split_tokens = []
for token in tokens:
split_tokens += self.wordpiece_tokenizer.tokenize(token)
return split_tokens
After:
def tokenize(self, tokens):
text = ''.join(tokens)
split_tokens = []
for token in self.basic_tokenizer.tokenize(text):
for sub_token in self.wordpiece_tokenizer.tokenize(token):
split_tokens.append(sub_token)
return split_tokens
Example: '东台市' is tokenized as ['东', '##台', '##市'] in UDA while ['东', '台', '市'] in BERT. After modifying the tokenizer, I got the same accuracy as BERT.
I fixed my problem by modifying the tokenizer. The tokenizer UDA used is not consistent with BERT pretrain model for Chinese. Before:
def tokenize_to_wordpiece(self, tokens): split_tokens = [] for token in tokens: split_tokens += self.wordpiece_tokenizer.tokenize(token) return split_tokens
After:
def tokenize(self, tokens): text = ''.join(tokens) split_tokens = [] for token in self.basic_tokenizer.tokenize(text): for sub_token in self.wordpiece_tokenizer.tokenize(token): split_tokens.append(sub_token) return split_tokens
Example: '东台市' is tokenized as ['东', '##台', '##市'] in UDA while ['东', '台', '市'] in BERT. After modifying the tokenizer, I got the same accuracy as BERT.
You can also try ERNIE on chinese text classification tasks.
I fixed my problem by modifying the tokenizer. The tokenizer UDA used is not consistent with BERT pretrain model for Chinese. Before:
def tokenize_to_wordpiece(self, tokens): split_tokens = [] for token in tokens: split_tokens += self.wordpiece_tokenizer.tokenize(token) return split_tokens
After:
def tokenize(self, tokens): text = ''.join(tokens) split_tokens = [] for token in self.basic_tokenizer.tokenize(text): for sub_token in self.wordpiece_tokenizer.tokenize(token): split_tokens.append(sub_token) return split_tokens
Example: '东台市' is tokenized as ['东', '##台', '##市'] in UDA while ['东', '台', '市'] in BERT. After modifying the tokenizer, I got the same accuracy as BERT.
Where to modify this code?
Hi, thank you for your wonderful work! I tried UDA without any augmentation on my text classification task, I can only get 93% accuracy while BERT can get 96% accuracy with the same steps and learning rate. Is there any suggestions I can work on? Thanks!