stefan-it / turkish-bert

Turkish BERT/DistilBERT, ELECTRA and ConvBERT models
482 stars 42 forks source link

Why `handle_chinese_chars=False`? #18

Closed PhilipMay closed 3 years ago

PhilipMay commented 3 years ago

Hi Stefan,

could you please quickly explain why did you say handle_chinese_chars=False?

Thanks Philip

stefan-it commented 3 years ago

Hi @PhilipMay

I did not expect a huge amount of characters in the training corpus that would fall into this character range (copied from tokenizers library):

https://github.com/huggingface/tokenizers/blob/371478027712de9895e5df6e2243b82af7cd614e/tokenizers/src/normalizers/bert.rs#L28-L49

/// Checks whether a character is chinese
/// This defines a "chinese character" as anything in the CJK Unicode block:
///   https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_(Unicode_block)
///
/// Note that the CJK Unicode block is NOT all Japanese and Korean characters,
/// despite its name. The modern Korean Hangul alphabet is a different block,
/// as is Japanese Hiragana and Katakana. Those alphabets are used to write
/// space-separated words, so they are not treated specially and handled
/// like for all of the other languages.
fn is_chinese_char(c: char) -> bool {
    match c as usize {
        0x4E00..=0x9FFF => true,
        0x3400..=0x4DBF => true,
        0x20000..=0x2A6DF => true,
        0x2A700..=0x2B73F => true,
        0x2B740..=0x2B81F => true,
        0x2B920..=0x2CEAF => true,
        0xF900..=0xFAFF => true,
        0x2F800..=0x2FA1F => true,
        _ => false,
    }
}

That's the reason why I disabled this option in the creation step of the vocab :)

PhilipMay commented 3 years ago

Ok thanks. Closing this again.