Rather than make the user specify whether the vocabulary is cased, we should be able to infer this from the vocabulary itself with a very high degree of confidence.
The place to do this is probably in FullTokenizer (so anything upstream of that would lose do_lower_case as a parameter).
For example, could add:
do_lower_case <- !any(grepl("^[A-Z]", inv_vocab))
after the line
inv_vocab <- names(vocab)
in tokenization.R
The above assumes that a vocabulary is cased iff it contains at least one token that begins with an uppercase letter. This ensures that we skip any special tokens like [SEP] or [CLS]. Technically, somebody could perversely construct a vocab that is cased, but no tokens start with capitals. The above code would classify such a vocabulary as "uncased", though I believe the correct classification in that case (heh) would be "WTF".
Rather than make the user specify whether the vocabulary is cased, we should be able to infer this from the vocabulary itself with a very high degree of confidence. The place to do this is probably in FullTokenizer (so anything upstream of that would lose
do_lower_case
as a parameter). For example, could add:do_lower_case <- !any(grepl("^[A-Z]", inv_vocab))
after the lineinv_vocab <- names(vocab)
in tokenization.RThe above assumes that a vocabulary is cased iff it contains at least one token that begins with an uppercase letter. This ensures that we skip any special tokens like [SEP] or [CLS]. Technically, somebody could perversely construct a vocab that is cased, but no tokens start with capitals. The above code would classify such a vocabulary as "uncased", though I believe the correct classification in that case (heh) would be "WTF".