google-research / electra

ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators
Apache License 2.0
2.31k stars 351 forks source link

Add toggle to turn off `strip_accents`. #88

Closed PhilipMay closed 3 years ago

PhilipMay commented 3 years ago

In some languages like German the accents are important and change the sementics. Examples:

  1. mochte vs. möchte
  2. musste vs. müsste
  3. etc.

But when doing lower_case they are automatically always stripped.

This PR adds a toggle to make it possible to do lower_case but keep the accents. This conforms to the transformers.tokenization_bert.BertTokenizerFast which also has an boolean parameter called strip_accents.

PhilipMay commented 3 years ago

@stefan-it Since you are also training language models for German language (and many others): could you please also have a look onto this PR and say what you think? Many thanks.

PhilipMay commented 3 years ago

Are there any concerns or questions about this PR? Is there any reason not to accept it? Thanks Philip

stefan-it commented 3 years ago

LGTM, I tested it with the input:

ÖÄÜ?
ßöäü

Together with --do-lower-case and --no-strip-accents the output from bert_tokens = self._tokenizer.tokenize(line) looks like:

['ö', '##ä', '##ü', '?']
['ß', '##ö', '##ä', '##ü']

whereas the --do-strip-accents option outputs:

['o', '##au', '?']
['ß', '##oa', '##u']
PhilipMay commented 3 years ago

@stefan-it many thanks for your evaluation. ;-)

PhilipMay commented 3 years ago

Hi Electra Team. It would be awesome if you could merge this PR or give me a hint what you are still missing. The same change has been made my be to Hugging Face Transformers (and merged): https://github.com/huggingface/transformers/pull/6280

Many thanks Philip

PhilipMay commented 3 years ago

Dear maintainers. Could you please have a look on this PR? Thanks...

PhilipMay commented 3 years ago

Hey Google Research team. Could you please check this PR? @clarkkev

PhilipMay commented 3 years ago

Hello dear friends from Google research - a friendly reminder to check this PR and maybe merge it. Thanks Philip

lmthang commented 3 years ago

Hi @PhilipMay, thanks for the contribution and sorry for the delay. We will take a look shortly.

PhilipMay commented 3 years ago

I added a comment above. I think the original behavior is preserved as it is now with

stefan-it commented 3 years ago

Original model is uncased, so yeah I think these default are correct :+1:

PhilipMay commented 3 years ago

Awesome - thanks for your review @stefan-it .

lmthang commented 3 years ago

Ah that's right. We stripped the accents by default.