Closed hahmyg closed 5 years ago
Hi, you can use the multilingual model as indicated in the readme with the commands:
tokenizer = BertTokenizer.from_pretrained('bert-base-multilingual')
model = BertModel.from_pretrained('bert-base-multilingual')
This will load the multilingual vocabulary (which should contain korean) that your command was not loading.
Dear authors, I have two questions.
First, how can I use multilingual pre-trained BERT in pytorch? Is it all download model to $BERT_BASE_DIR?
Second is tokenization issue. For Chinese and Japanese, tokenizer may works, however, for Korean, it shows different result that I expected
` ['ᄋ', '##ᅡ', '##ᆫ', '##ᄂ', '##ᅧ', '##ᆼ', '##ᄒ', '##ᅡ', '##ᄉ', '##ᅦ', '##ᄋ', '##ᅭ']
The result is based on not 'character' but 'byte-based character' May it comes from unicode issue. (I expect ['안녕', '##하세요'])