Closed mariosasko closed 6 months ago
@mariosasko any reason that this hasn't been merged yet? Would be very helpful for parallelizing encodings across spark DFs.
@mariosasko @ntopousis I'm also hitting this issue when trying to use tiktoken in spark dataframes, any way this could be merged and deployed?
Thanks for the PR, I've been using a different patch internally to allow Encoding to be pickled, that has slightly different semantics. I released it in tiktoken 0.6. Please try it out and let me know if it doesn't work well for your use case.
Thanks to https://github.com/karpathy/nanoGPT, many users use
tiktoken
with thedatasets
map
method in the multi-process mode. However,tiktoken.Encoding
currently doesn't support pickling (_core_bpe
referencing a Rust binding object is the problem), which leads to serialization errors (see https://github.com/huggingface/datasets/issues/5536#issuecomment-1681820932 for the reproducer). This PR implementstitoken.Encoding.__reduce__
to fix this.