Closed schnerd closed 4 years ago
Input string: hello 👋 world 🌍
hello 👋 world 🌍
Python
from transformers import GPT2TokenizerFast tokenizer = GPT2TokenizerFast.from_pretrained("gpt2") prompt = "hello 👋 world 🌍" encoded = tokenizer.encode(prompt) print(f"Count: {len(encoded)}") # 7 print(f"Decoded: {tokenizer.decode(encoded)}") # hello 👋 world 🌍
GPT-3-Encoder
const {encode, decode} = require('./encoder.js') const str = 'hello 👋 world 🌍' const encoded = encode(str) console.log('Count: ', encoded.length); // 4 console.log('Decoded: ', decode(encoded)); // hello world
Just wanted to document the issue, no immediate fix expected. Maybe someone from the community will find the urge to submit a PR if you all don't get around to it.
Thanks for the issue report! I tracked down the bug. (Had to do with UTF-8 encoding) and fixed it. It should be fixed with version 1.1.0.
Input string:
hello 👋 world 🌍
Python
GPT-3-Encoder
Just wanted to document the issue, no immediate fix expected. Maybe someone from the community will find the urge to submit a PR if you all don't get around to it.