jzhang38 / TinyLlama

The TinyLlama project is an open endeavor to pretrain a 1.1B Llama model on 3 trillion tokens.
Apache License 2.0
7.3k stars 425 forks source link

Unable to pretrain: tokenizer raises NotImplementedError #143

Closed zxti closed 4 months ago

zxti commented 5 months ago

When following PRETRAIN.md and running one of the data prep scripts:

python scripts/prepare_slimpajama.py --source_path datasets/SlimPajama-627B/ --tokenizer_path data/llama --destination_path data/slim_star_combined --split validation --percentage 1.0

The tokenizer throws this. It seems a checkpoint is first needed, data/llama? How do you get this?

Process Process-1:
Traceback (most recent call last):
  File "/home/ubuntu/.asdf/installs/python/3.8.18/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/home/ubuntu/.asdf/installs/python/3.8.18/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "scripts/prepare_slimpajama.py", line 39, in prepare_full
    tokenizer = Tokenizer(tokenizer_path)
  File "/home/ubuntu/TinyLlama/lit_gpt/tokenizer.py", line 29, in __init__
    raise NotImplementedError
NotImplementedError
Process Process-2:
Traceback (most recent call last):
  File "/home/ubuntu/.asdf/installs/python/3.8.18/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/home/ubuntu/.asdf/installs/python/3.8.18/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "scripts/prepare_slimpajama.py", line 39, in prepare_full
    tokenizer = Tokenizer(tokenizer_path)
  File "/home/ubuntu/TinyLlama/lit_gpt/tokenizer.py", line 29, in __init__
    raise NotImplementedError
NotImplementedError
Process Process-3:
Traceback (most recent call last):
  File "/home/ubuntu/.asdf/installs/python/3.8.18/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/home/ubuntu/.asdf/installs/python/3.8.18/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "scripts/prepare_slimpajama.py", line 39, in prepare_full
    tokenizer = Tokenizer(tokenizer_path)
  File "/home/ubuntu/TinyLlama/lit_gpt/tokenizer.py", line 29, in __init__
    raise NotImplementedError
NotImplementedError
Process Process-4:
Traceback (most recent call last):
  File "/home/ubuntu/.asdf/installs/python/3.8.18/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/home/ubuntu/.asdf/installs/python/3.8.18/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "scripts/prepare_slimpajama.py", line 39, in prepare_full
    tokenizer = Tokenizer(tokenizer_path)
  File "/home/ubuntu/TinyLlama/lit_gpt/tokenizer.py", line 29, in __init__
    raise NotImplementedError
NotImplementedError
Time taken: 0.02 seconds
m0Nst3r873 commented 5 months ago

I've met the same error. If you fixed it, let me know please

ChaosCodes commented 4 months ago

Hi, you can download the tokenizer with mkdir data && cd data && mkdir llama && cd llama && wget https://huggingface.co/TinyLlama/TinyLlama-1.1B-intermediate-step-480k-1T/blob/main/tokenizer.model && cd ../..

awgr commented 3 months ago

That URL will serve you a redirect, so wget will download an html file and name it tokenizer.model.