Closed itchanghi closed 5 years ago
Hi @itchanghi, thanks for the feedback. Indeed the run_squad
example was not updated for cased
models. I fixed that in commits c9fd3505678d581388fb44ba1d79ac41e8fb28a4 and 2e4db64cab198dc241e18221ef088908f2587c61.
Please re-open the issue if your problem is not fixed (and maybe summarize it in an updated version).
It seems that default do_lower_case is still True.
I'm working on fine-tuning squad task with multilingual-cased model.
Google says "When using a cased model, make sure to pass --do_lower=False to the training scripts. (Or pass do_lower_case=False directly to FullTokenizer if you're using your own script.)"
So, I added "do_lower_case" argument to run squad script. However I got a some wired token converted result like this ['[UNK]', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '?'].
I think that there are two problem on example of run_squad.py.
"--do_lower_case" 's default value is True also action value is 'store_true which means any case goes to args.do_lower_case value set True.
to be changed : default=True -> default=False
even though changing like above, tokenizer never know what happen ~
initialize Tokenizer
tokenizer = BertTokenizer.from_pretrained(args.bert_model)
When BertTokenizer's init method, do_lower_case is set True as a default.That's why calling classmethod from_pretrained with no additional argument, there is no way to change do_lower_case value.
to be changed : BertTokenizer.from_pretrained(args.bert_model, do_lower_case=False)
It is not possibly problem but some one can be suffered by this issue. many thanx to fix.
BTW, I Do not still understand... Why I got [UNK] tokens except English, punctuations and numbers.
The input text is Korean.
When reading data, the do_lower_case flag do work only call "token.lower()" and "_run_strip_accent(text)" or not.
When do_lower_case value is false, tokenizer work fine. I got a result as expected. This time tokens are not through "token.lower()" and "_run_strip_accent(text)" methods.
Even If set do_lower_case value to true then "token.lower()" and "_run_strip_accent(text)" methods are called, there is no difference. because I debug in _run_strip_accent method and input string value and return string value two are same.
input string is just splited and checked if there are accent characters or not. but Korean doesn't have accent characters. So, joining output list is completely restoring input text value.
any advice ?