huggingface / swift-coreml-transformers

Swift Core ML 3 implementations of GPT-2, DistilGPT-2, BERT, and DistilBERT for Question answering. Other Transformers coming soon!
Apache License 2.0
1.61k stars 176 forks source link

Modify `BertTokenizer` #31

Closed fummicc1 closed 1 year ago

fummicc1 commented 1 year ago

When I use BertTokenizer I felt some points needs to be modified.

missing whitespace to each token.

After spliting texts by whitespace `, subsequential components are missing `.

Example Input : Hello, I like dog. Actual tokens : ["Hello", "I", "like", "dog"] Expected tokens in my case : ["Hello", " I", " like", " dog"]

force to be lowercase

All of inputs are converted into lowercased but I guess this behavior might not be correct (depends on vocab.json).

I fixed above two things. please correct or close this PR if I am wrong.