jonathanbratt / RBERT

Implementation of BERT in R
Apache License 2.0
158 stars 19 forks source link

infer "casedness" of vocabulary from vocabulary #35

Closed jonathanbratt closed 3 years ago

jonathanbratt commented 4 years ago

Rather than make the user specify whether the vocabulary is cased, we should be able to infer this from the vocabulary itself with a very high degree of confidence. The place to do this is probably in FullTokenizer (so anything upstream of that would lose do_lower_case as a parameter). For example, could add: do_lower_case <- !any(grepl("^[A-Z]", inv_vocab)) after the line inv_vocab <- names(vocab) in tokenization.R

The above assumes that a vocabulary is cased iff it contains at least one token that begins with an uppercase letter. This ensures that we skip any special tokens like [SEP] or [CLS]. Technically, somebody could perversely construct a vocab that is cased, but no tokens start with capitals. The above code would classify such a vocabulary as "uncased", though I believe the correct classification in that case (heh) would be "WTF".

jonathanbratt commented 3 years ago

This is done in the wordpiece package.