latitudegames / GPT-3-Encoder

Javascript BPE Encoder Decoder for GPT-2 / GPT-3
MIT License
716 stars 196 forks source link

Input with emojis tokenizes differently than python impl #1

Closed schnerd closed 4 years ago

schnerd commented 4 years ago

Input string: hello 👋 world 🌍

Python

from transformers import GPT2TokenizerFast

tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")

prompt = "hello 👋 world 🌍"
encoded = tokenizer.encode(prompt)

print(f"Count: {len(encoded)}") # 7
print(f"Decoded: {tokenizer.decode(encoded)}") # hello 👋 world 🌍

GPT-3-Encoder

const {encode, decode} = require('./encoder.js')

const str = 'hello 👋 world 🌍'
const encoded = encode(str)

console.log('Count: ', encoded.length); // 4
console.log('Decoded: ', decode(encoded)); // hello  world

Just wanted to document the issue, no immediate fix expected. Maybe someone from the community will find the urge to submit a PR if you all don't get around to it.

nickwalton commented 4 years ago

Thanks for the issue report! I tracked down the bug. (Had to do with UTF-8 encoding) and fixed it. It should be fixed with version 1.1.0.