When using bert-base-chinese model, except for the first one, other uppercase English letters that are the same in succession will be ignored. And the input_id of different uppercase English characters is the same

longweiwei commented 2 years ago

Environment info

transformers version: 4.12.3
Platform: macos
Python version: 1.7.0
PyTorch version (GPU?):No
Tensorflow version (GPU?):-
Using GPU in script?: No
Using distributed or parallel set-up in script?: No

Who can help

@LysandreJik

Information

Model I am using (Bert, XLNet ...):

The problem arises when using:

[x] the official example scripts: (give details below)
[ ] my own modified scripts: (give details below)

The tasks I am working on is:

[ ] an official GLUE/SQUaD task: (give the name)
[x] my own task or dataset: (give details below)

To reproduce

Steps to reproduce the behavior: here is code: 1.tokenizer = BertTokenizer.from_pretrained("bert-base-chinese") 2.encode = tokenizer(["A", "B", "AAA"])

result:

The input_id of 'A' and 'B' are both 100.
After AAA is tokenized, only A is left.

Expected behavior

Don't omit any English letters after tokenized.
The input_id corresponding to different English letters should be different

longweiwei commented 2 years ago

hi! @vanpelt @pvl @arfon @xeb
can someone hele me？ thank you all very much.

github-actions[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

huggingface / transformers