openai / gpt-2

Code for the paper "Language Models are Unsupervised Multitask Learners"
https://openai.com/blog/better-language-models/
Other
22.34k stars 5.5k forks source link

BPE tokenizer has problem in training the low source language like persian #264

Open Mary-NJ opened 4 years ago

Mary-NJ commented 4 years ago

hi everybody, I'm trying to start train gpt2 in a large amount of Persian data for the special tasks.. but now I got a problem with this tokenizer... after training one data, the .json and .txt frequency information files include some unknown characters:((((( for example: "ĠبادÙĩا" something like this... it's good to mention that BPE tokenizer has no problem in English texts... and it makes me confused because this had trained with the Persian dataset but couldn't encode simple Persian sentence...

logging.basicConfig(level=logging.INFO)

# paths = [str(x) for x in Path("./eo_data/").glob("**/*.txt")]
paths = [str(x) for x in Path("./txt/").glob("**/*.txt")]
tokenizer = ByteLevelBPETokenizer()

for path in paths:
    # logging.info(path)
    try:
        tokenizer.train(files=paths, vocab_size=52_000, min_frequency=2, special_tokens=[
            "<s>",
            "<pad>",
            "</s>",
            "<unk>",
            "<mask>", ])
    except Exception as e:
        logging.warning(e)
        logging.warning(path)

# Save files to disk
tokenizer.save(".","Maryam_V1")

it is a related piece of code... make me happy by your guidance:(((

amirsh87 commented 1 year ago

I have same issue