Closed chinoll closed 3 months ago
@chinoll I confirm this issue.
It is linked to inconsistency in how the trainer is automatically called from your code.
A quick fix for you use case is this:
from tokenizers import Tokenizer, trainers
tokenizer = Tokenizer.from_pretrained("bert-base-chinese")
tokenizer.save("tok.json")
special_tokens = [
"[PAD]",
"[UNK]",
"[CLS]",
"[SEP]",
"[MASK]",
]
trainer = trainers.WordPieceTrainer(
special_tokens=special_tokens,
)
tokenizer.train(["data/big.txt"], trainer=trainer)
Basically simply add the special_tokens
back to the trainer.
Please note that since your are retraining, the ids of those tokens will probably change (they are ID 0, 100, 101-103) in the original bert-base-chinese
)
@chinoll I confirm this issue.
It is linked to inconsistency in how the trainer is automatically called from your code.
A quick fix for you use case is this:
from tokenizers import Tokenizer, trainers tokenizer = Tokenizer.from_pretrained("bert-base-chinese") tokenizer.save("tok.json") special_tokens = [ "[PAD]", "[UNK]", "[CLS]", "[SEP]", "[MASK]", ] trainer = trainers.WordPieceTrainer( special_tokens=special_tokens, ) tokenizer.train(["data/big.txt"], trainer=trainer)
Basically simply add the
special_tokens
back to the trainer.Please note that since your are retraining, the ids of those tokens will probably change (they are ID 0, 100, 101-103) in the original
bert-base-chinese
)
Thanks, this solved my problem
I will reopen this issue if you don't mind since the easy fix, works but is not the end of it.
IMHO, the code you submitted should work out of the box, at least no panic should ever occur, a regular exception might be acceptable.
This issue will help when work starts on this tokenizer/trainer
information flow that we need to fix and/or tokenizer fine-tuning.
A lot of issues seem to be popping in this area and fixing those really (not with a modified script that requires expert knowledge) seems like a good idea.
Glad it solved your problem though.
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.
python code:
error: