huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
131.71k stars 26.22k forks source link

When using bert-base-chinese model, except for the first one, other uppercase English letters that are the same in succession will be ignored. And the input_id of different uppercase English characters is the same #14990

Closed longweiwei closed 2 years ago

longweiwei commented 2 years ago

Environment info

Who can help

@LysandreJik

Information

Model I am using (Bert, XLNet ...):

The problem arises when using:

The tasks I am working on is:

To reproduce

Steps to reproduce the behavior: here is code: 1.tokenizer = BertTokenizer.from_pretrained("bert-base-chinese") 2.encode = tokenizer(["A", "B", "AAA"])

result:

  1. The input_id of 'A' and 'B' are both 100.
  2. After AAA is tokenized, only A is left.

Expected behavior

  1. Don't omit any English letters after tokenized.
  2. The input_id corresponding to different English letters should be different
longweiwei commented 2 years ago

hi! @vanpelt @pvl @arfon @xeb
can someone hele me? thank you all very much.

github-actions[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.