simonw / datasette-openai

SQL functions for calling OpenAI APIs
https://datasette.io/plugins/datasette-openai
Apache License 2.0
21 stars 3 forks source link

Helper functions for tokenizing text #7

Closed simonw closed 1 year ago

simonw commented 1 year ago

I need this for:

May as well expose these as functions too.

simonw commented 1 year ago

I'm going to use the GPT-2 tokenizer code from here: https://github.com/openai/gpt-2/blob/a74da5d99abaaba920de8131d64da2862a8f213b/src/encoder.py#L53

re.compile(r"""'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+""")

That actually uses the regex module because of the \p sequences, so I'll have to add that as a dependency.