Closed WUHU-G closed 11 months ago
Too sad, make a day, finally write to support Chinese pat_str plus bit encoding successfully added Chinese ,,,, but can not normal word segmentation
"'s|'t|'re|'ve|'m|'ll|'d| ?[\p{L}\p{Han}]+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+"
This is expected behaviour from BPE; some merges are higher priority than others. See also https://github.com/openai/tiktoken/blob/main/tiktoken/_educational.py as a resource to help with questions about BPE. Also note that pasting text is much more effective than pasting screenshots.