karpathy / nano-llama31

nanoGPT style version of Llama 3.1
1.23k stars 57 forks source link

API not available in llama3_1 #8

Open erlebach opened 2 months ago

erlebach commented 2 months ago

In reference.py, the following fails: ‘’’ from llama_models.llama3_1.api import ModelArgs ‘’’ I aliased api/ in ‘llama3’ to be available in ‘llama3_1’, but that generated an error, which relates to the transformer library not being available: ‘’’ torchrun --nnodes 1 --nproc_per_node 1 reference.py --ckpt_dir llama-models/models/llama3_1/Meta-Llama-3.1-8B --tokenizer_path llama-models/models/llama3_1/Meta-Llama-3.1-8B/tokenizer.model

W0831 11:24:51.753000 8406420480 torch/distributed/elastic/multiprocessing/redirects.py:28] NOTE: Redirects are currently not supported in Windows or MacOs. Traceback (most recent call last): File "/Users/erlebach/src/2024/nano-llama31/reference.py", line 41, in from llama_models.llama3_1.api import Transformer ImportError: cannot import name 'Transformer' from 'llama_models.llama3_1.api' (/Users/erlebach/src/2024/nano-llama31/llama-models/models/llama3_1/api/init.py) E0831 11:24:53.044000 8406420480 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 0 (pid: 72150) of binary: /Users/erlebach/src/2024/nano-llama31/.venv/bin/python Traceback (most recent call last): File "/Users/erlebach/src/2024/nano-llama31/.venv/bin/torchrun", line 8, in sys.exit(main()) ^^^^^^ File "/Users/erlebach/src/2024/nano-llama31/.venv/lib/python3.12/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 348, in wrapper return f(*args, **kwargs) ^^^^^^^^^^^^^^^^^^ File "/Users/erlebach/src/2024/nano-llama31/.venv/lib/python3.12/site-packages/torch/distributed/run.py", line 901, in main run(args) File "/Users/erlebach/src/2024/nano-llama31/.venv/lib/python3.12/site-packages/torch/distributed/run.py", line 892, in run elastic_launch( File "/Users/erlebach/src/2024/nano-llama31/.venv/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 133, in call return launch_agent(self._config, self._entrypoint, list(args)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/erlebach/src/2024/nano-llama31/.venv/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

reference.py FAILED

Failures:

------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-08-31_11:24:53 host : 1.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.ip6.arpa rank : 0 (local_rank: 0) exitcode : 1 (pid: 72150) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ‘’’ My question is why api is not available in llama3_1 and why reference.py assumes it is. Thanks.
indianspeedster commented 2 months ago

This issue is because Meta has changed the structure of llama-models code base.

A bypass fix to this issue would be

cd ~/nano-llama31/llama-models
git checkout ae2f290ffcdc7cdc621cf5f3ae10011861b50a77

Also make sure to place the previously downloaded model "Meta-Llama-3.1-8B" inside nano-llama31/llama-models/models/llama3_1/