openai / tiktoken

tiktoken is a fast BPE tokeniser for use with OpenAI's models.
MIT License
11.17k stars 751 forks source link

Pickling tokenizer fails due to builtins.CoreBPE #231

Closed jerheff closed 5 months ago

jerheff commented 6 months ago

I am using tiktoken in a dataset preprocessing step for a pytorch DataLoader. They support multiprocessing in creating batches which spawns workers. This fails with exception:

TypeError: cannot pickle 'builtins.CoreBPE' object

I am not familiar with Rust, but this thread seems to suggest that a few methods in the Rust implementation would enable pickling the tokenizer.

sk-g commented 6 months ago

+1

Fails with this error when used in multiprocessing context, eg


    train_dataset = train_dataset.map(
        tokenization_function,
        batched=True,
        batch_size=1000,
        num_proc=args.num_proc,
        load_from_cache_file=not args.overwrite_cache,
        desc=f"Running tokenizer on train dataset with {len(train_dataset)} items"
        )
sk-g commented 6 months ago

Related issue: https://github.com/openai/tiktoken/pull/181

hauntsaninja commented 5 months ago

I allow Encoding to be pickled in tiktoken 0.6. Please let me know if the implementation doesn't work well for you!

jerheff commented 5 months ago

@hauntsaninja Works for me. Thanks!