Closed simonw closed 1 year ago
I'm going to use the GPT-2 tokenizer code from here: https://github.com/openai/gpt-2/blob/a74da5d99abaaba920de8131d64da2862a8f213b/src/encoder.py#L53
re.compile(r"""'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+""")
That actually uses the regex module because of the \p
sequences, so I'll have to add that as a dependency.
I need this for:
4
May as well expose these as functions too.