Closed jerheff closed 9 months ago
+1
Fails with this error when used in multiprocessing context, eg
train_dataset = train_dataset.map(
tokenization_function,
batched=True,
batch_size=1000,
num_proc=args.num_proc,
load_from_cache_file=not args.overwrite_cache,
desc=f"Running tokenizer on train dataset with {len(train_dataset)} items"
)
Related issue: https://github.com/openai/tiktoken/pull/181
I allow Encoding to be pickled in tiktoken 0.6. Please let me know if the implementation doesn't work well for you!
@hauntsaninja Works for me. Thanks!
I am using tiktoken in a dataset preprocessing step for a pytorch DataLoader. They support multiprocessing in creating batches which spawns workers. This fails with exception:
TypeError: cannot pickle 'builtins.CoreBPE' object
I am not familiar with Rust, but this thread seems to suggest that a few methods in the Rust implementation would enable pickling the tokenizer.