Closed Mypathissional closed 11 months ago
There is no semantic difference. r50k is stored in a format that loads a little faster and has less redundancy. Additionally, it's a nice proof that tiktoken can load GPT-2's encoding, which is as close to a standard as this field gets :-)
Hello,
I've been comparing the merge lists of "gpt2" and "r50k_base". Interestingly, they are identical. Furthermore, other configuration details such as regex patterns and special characters are also the same. Given these similarities, I'm curious about the rationale behind having two distinct encoding methods.
Here's the code I used for the comparison:
I'd appreciate any insights or explanations on this matter. Thank you!