karpathy / minbpe

Minimal, clean code for the Byte Pair Encoding (BPE) algorithm commonly used in LLM tokenization.
MIT License
9.19k stars 866 forks source link

decode() method in GPT4Tokenizer does not handle special tokens #64

Open Vakarva opened 7 months ago

Vakarva commented 7 months ago

It appears that the decode() method in the GPT4Tokenizer class does not handle special tokens. I submitted a pull request (#63) with some updated code, but also wanted to post the issue here. Here is the original code for reference:

def decode(self, ids):
  # we have to un-permute the bytes before we decode
  text_bytes = b"".join(self.vocab[idx] for idx in ids)
  text_bytes = bytes(self.inverse_byte_shuffle[b] for b in text_bytes)
  text = text_bytes.decode("utf-8", errors="replace")
  return text