Closed fummicc1 closed 1 year ago
When I use BertTokenizer I felt some points needs to be modified.
BertTokenizer
After spliting texts by whitespace `, subsequential components are missing `.
`, subsequential components are missing
Example Input : Hello, I like dog. Actual tokens : ["Hello", "I", "like", "dog"] Expected tokens in my case : ["Hello", " I", " like", " dog"]
Hello, I like dog.
All of inputs are converted into lowercased but I guess this behavior might not be correct (depends on vocab.json).
lowercased
vocab.json
I fixed above two things. please correct or close this PR if I am wrong.
When I use
BertTokenizer
I felt some points needs to be modified.missing whitespace to each token.
After spliting texts by whitespace
`, subsequential components are missing
`.Example Input :
Hello, I like dog.
Actual tokens : ["Hello", "I", "like", "dog"] Expected tokens in my case : ["Hello", " I", " like", " dog"]force to be lowercase
All of inputs are converted into
lowercased
but I guess this behavior might not be correct (depends onvocab.json
).I fixed above two things. please correct or close this PR if I am wrong.