Closed PhilipMay closed 3 years ago
@stefan-it Since you are also training language models for German language (and many others): could you please also have a look onto this PR and say what you think? Many thanks.
Are there any concerns or questions about this PR? Is there any reason not to accept it? Thanks Philip
LGTM, I tested it with the input:
ÖÄÜ?
ßöäü
Together with --do-lower-case
and --no-strip-accents
the output from bert_tokens = self._tokenizer.tokenize(line)
looks like:
['ö', '##ä', '##ü', '?']
['ß', '##ö', '##ä', '##ü']
whereas the --do-strip-accents
option outputs:
['o', '##au', '?']
['ß', '##oa', '##u']
@stefan-it many thanks for your evaluation. ;-)
Hi Electra Team. It would be awesome if you could merge this PR or give me a hint what you are still missing. The same change has been made my be to Hugging Face Transformers (and merged): https://github.com/huggingface/transformers/pull/6280
Many thanks Philip
Dear maintainers. Could you please have a look on this PR? Thanks...
Hey Google Research team. Could you please check this PR? @clarkkev
Hello dear friends from Google research - a friendly reminder to check this PR and maybe merge it. Thanks Philip
Hi @PhilipMay, thanks for the contribution and sorry for the delay. We will take a look shortly.
I added a comment above. I think the original behavior is preserved as it is now with
do_lower_case=True
strip_accents=True
as defaults.Original model is uncased, so yeah I think these default are correct :+1:
Awesome - thanks for your review @stefan-it .
Ah that's right. We stripped the accents by default.
In some languages like German the accents are important and change the sementics. Examples:
But when doing
lower_case
they are automatically always stripped.This PR adds a toggle to make it possible to do
lower_case
but keep the accents. This conforms to thetransformers.tokenization_bert.BertTokenizerFast
which also has an boolean parameter calledstrip_accents
.