hi everybody, I'm trying to start train gpt2 in a large amount of Persian data for the special tasks..
but now I got a problem with this tokenizer...
after training one data, the .json and .txt frequency information files include some unknown characters:(((((
for example:
"ĠبادÙĩا"
something like this...
it's good to mention that BPE tokenizer has no problem in English texts... and it makes me confused because this had trained with the Persian dataset but couldn't encode simple Persian sentence...
logging.basicConfig(level=logging.INFO)
# paths = [str(x) for x in Path("./eo_data/").glob("**/*.txt")]
paths = [str(x) for x in Path("./txt/").glob("**/*.txt")]
tokenizer = ByteLevelBPETokenizer()
for path in paths:
# logging.info(path)
try:
tokenizer.train(files=paths, vocab_size=52_000, min_frequency=2, special_tokens=[
"<s>",
"<pad>",
"</s>",
"<unk>",
"<mask>", ])
except Exception as e:
logging.warning(e)
logging.warning(path)
# Save files to disk
tokenizer.save(".","Maryam_V1")
it is a related piece of code... make me happy by your guidance:(((
hi everybody, I'm trying to start train gpt2 in a large amount of Persian data for the special tasks.. but now I got a problem with this tokenizer... after training one data, the .json and .txt frequency information files include some unknown characters:((((( for example: "ĠبادÙĩا" something like this... it's good to mention that BPE tokenizer has no problem in English texts... and it makes me confused because this had trained with the Persian dataset but couldn't encode simple Persian sentence...
it is a related piece of code... make me happy by your guidance:(((