Open matejklemen opened 4 years ago
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Keep it open
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
👋
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Bump
Environment info
transformers
version: 3.1.0Who can help
Most appropriate seems @mfuntowicz (tokenization), blame says @thomwolf.
Information
Model I am using (Bert, XLNet ...): OpenAI GPT (also BERT)
The problem arises when using:
The tasks I am working on is:
I am adding special tokens (
BOS
,SEP
andEOS
) to GPT1 tokenizer in order to format and fine-tune a GPT model a bit differently. I am also making use of the convenientreturn_special_tokens_mask
argument inencode_plus()
, though it does not seem to mark the added custom special tokens as special in the returned mask.The same is also true when adding custom special tokens to BERT tokenizer. I did not check beyond these two. The problem for GPT seems to be that
get_special_tokens_mask()
intokenization_utils.py
does not seem to take into account any special tokens.For BERT, it only seems to take into account
[CLS]
and[SEP]
.To reproduce
Expected behavior
I would expect that the additional special tokens also get marked as special, i.e. that the
special_tokens_mask
in above snippet returns[1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1]