karpathy / minbpe

Minimal, clean code for the Byte Pair Encoding (BPE) algorithm commonly used in LLM tokenization.
MIT License
9.2k stars 866 forks source link

how to deal with special tokens for multiple files #44

Open IamExperimenting opened 8 months ago

IamExperimenting commented 8 months ago

Hi,

I have a question regarding Byte-Pair Encoding - Special tokens especailly, I have 1780 file with me which is my domain dataset, do I need to mention

  1. <|startoftext|> in the beginning of the text in each file and <|endoftext|> in the end of the text in the each file?
  2. or do I need to combine all 1780 files together as one? and mention <|endoftext|> at the end of text of each file, as Andrej mentioned this will let the model to consider as delimiter.
  3. minbpe is capable of handling those on it own?
  4. is there any specific format that I should prepare my data and pass to minbpe? like dataframe(each text file in each row)

can you please help me understand here @karpathy