openai / tiktoken

tiktoken is a fast BPE tokeniser for use with OpenAI's models.
MIT License
11.61k stars 784 forks source link

Make `Encoding` serializable #181

Closed mariosasko closed 6 months ago

mariosasko commented 1 year ago

Thanks to https://github.com/karpathy/nanoGPT, many users use tiktoken with the datasets map method in the multi-process mode. However, tiktoken.Encoding currently doesn't support pickling (_core_bpe referencing a Rust binding object is the problem), which leads to serialization errors (see https://github.com/huggingface/datasets/issues/5536#issuecomment-1681820932 for the reproducer). This PR implements titoken.Encoding.__reduce__ to fix this.

ntopousis commented 6 months ago

@mariosasko any reason that this hasn't been merged yet? Would be very helpful for parallelizing encodings across spark DFs.

S-King commented 6 months ago

@mariosasko @ntopousis I'm also hitting this issue when trying to use tiktoken in spark dataframes, any way this could be merged and deployed?

hauntsaninja commented 6 months ago

Thanks for the PR, I've been using a different patch internally to allow Encoding to be pickled, that has slightly different semantics. I released it in tiktoken 0.6. Please try it out and let me know if it doesn't work well for your use case.