huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
134.54k stars 26.91k forks source link

Multilingual Issue #49

Closed hahmyg closed 5 years ago

hahmyg commented 5 years ago

Dear authors, I have two questions.

First, how can I use multilingual pre-trained BERT in pytorch? Is it all download model to $BERT_BASE_DIR?

Second is tokenization issue. For Chinese and Japanese, tokenizer may works, however, for Korean, it shows different result that I expected

import torch
from pytorch_pretrained_bert import BertTokenizer, BertModel, BertForMaskedLM
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
text = "안녕하세요"
tokenized_text = tokenizer.tokenize(text)
print(tokenized_text)

` ['ᄋ', '##ᅡ', '##ᆫ', '##ᄂ', '##ᅧ', '##ᆼ', '##ᄒ', '##ᅡ', '##ᄉ', '##ᅦ', '##ᄋ', '##ᅭ']

The result is based on not 'character' but 'byte-based character' May it comes from unicode issue. (I expect ['안녕', '##하세요'])

thomwolf commented 5 years ago

Hi, you can use the multilingual model as indicated in the readme with the commands:

tokenizer = BertTokenizer.from_pretrained('bert-base-multilingual')
model = BertModel.from_pretrained('bert-base-multilingual')

This will load the multilingual vocabulary (which should contain korean) that your command was not loading.