bert-base-multilingual-cased, do lower case problem

itchanghi commented 5 years ago

I'm working on fine-tuning squad task with multilingual-cased model.

Google says "When using a cased model, make sure to pass --do_lower=False to the training scripts. (Or pass do_lower_case=False directly to FullTokenizer if you're using your own script.)"

So, I added "do_lower_case" argument to run squad script. However I got a some wired token converted result like this ['[UNK]', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '?'].

I think that there are two problem on example of run_squad.py.

default argument

parser.add_argument("--do_lower_case",
                    default=True,
                    action='store_true',
                    help="Whether to lower case the input text. True for uncased models, False for cased models.")

"--do_lower_case" 's default value is True also action value is 'store_true which means any case goes to args.do_lower_case value set True.

to be changed : default=True -> default=False

even though changing like above, tokenizer never know what happen ~

initialize Tokenizer tokenizer = BertTokenizer.from_pretrained(args.bert_model) When BertTokenizer's init method, do_lower_case is set True as a default.
```
def __init__(self, do_lower_case=True):
    """Constructs a BasicTokenizer.

    Args:
      do_lower_case: Whether to lower case the input.
    """
    self.do_lower_case = do_lower_case
```
That's why calling classmethod from_pretrained with no additional argument, there is no way to change do_lower_case value.

    @classmethod
    def from_pretrained(cls, pretrained_model_name, cache_dir=None, *inputs, **kwargs):
        """
        '''
           '''
         skip
        '''
        # Instantiate tokenizer.
        tokenizer = cls(resolved_vocab_file, *inputs, **kwargs)
        return tokenizer

to be changed : BertTokenizer.from_pretrained(args.bert_model, do_lower_case=False)

It is not possibly problem but some one can be suffered by this issue. many thanx to fix.

BTW, I Do not still understand... Why I got [UNK] tokens except English, punctuations and numbers.

The input text is Korean.

When reading data, the do_lower_case flag do work only call "token.lower()" and "_run_strip_accent(text)" or not.

When do_lower_case value is false, tokenizer work fine. I got a result as expected. This time tokens are not through "token.lower()" and "_run_strip_accent(text)" methods.

Even If set do_lower_case value to true then "token.lower()" and "_run_strip_accent(text)" methods are called, there is no difference. because I debug in _run_strip_accent method and input string value and return string value two are same.

    def _run_strip_accents(self, text):
        """Strips accents from a piece of text."""
        text = unicodedata.normalize("NFD", text)
        output = []
        for char in text:
            cat = unicodedata.category(char)
            if cat == "Mn":
                continue
            output.append(char)
        return "".join(output)

input string is just splited and checked if there are accent characters or not. but Korean doesn't have accent characters. So, joining output list is completely restoring input text value.

any advice ?

thomwolf commented 5 years ago

Hi @itchanghi, thanks for the feedback. Indeed the run_squad example was not updated for cased models. I fixed that in commits c9fd3505678d581388fb44ba1d79ac41e8fb28a4 and 2e4db64cab198dc241e18221ef088908f2587c61.

Please re-open the issue if your problem is not fixed (and maybe summarize it in an updated version).

empty-id commented 4 years ago

It seems that default do_lower_case is still True.

huggingface / transformers

bert-base-multilingual-cased, do lower case problem #131