Closed Sternbach-Software closed 7 months ago
BPE is only morphology aware to the extent implied by n-gram frequency. See https://github.com/openai/tiktoken#what-is-bpe-anyway and the tiktoken._educational
submodule.
What's "correct" here is determined by what matches what OpenAI's models were trained with. What's desirable here is determined by the resulting performance of models. If you had a different tokenisation algorithm that is reasonably fast and improves language model performance, that would be an interesting research result :-)
The morphology of "Elaborate" is e-labor-ate: out-work-produced by. The tokenizer at https://platform.openai.com/tokenizer says it is el-abor-ate.