Unable to run example program - example_text_completion.py

skathpalia commented 1 year ago

Unable to run the following command torchrun --nproc_per_node 1 example_text_completion.py --ckpt_dir llama-2-7b/ --tokenizer_path tokenizer.model
--max_seq_len 128 --max_batch_size 4

I am running it on MacBook Pro with following configuration.

ghost commented 1 year ago

me too.. :(

krychu commented 1 year ago

You will need to run it on CPU: https://github.com/krychu/llama. Let me know if that work with 16GB memory, might be a bit tight.

skathpalia commented 1 year ago

Thank you @krychu !

skathpalia commented 1 year ago

Now I get

RuntimeError: MPS backend out of memory (MPS allocated: 3.34 GB, other allocations: 9.99 MB, max allowed: 3.40 GB). Tried to allocate 86.00 MB on private pool. Use PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0 to disable upper limit for memory allocations (may cause system failure).

I inserted PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0

in generation.py file. But, I still get the same error..

Thank you for your help in advance.

skathpalia commented 1 year ago

Here are the results after fixing the above error

torchrun --nproc_per_node 1 example_text_completion.py \ --ckpt_dir llama-2-7b/ \ --tokenizer_path tokenizer.model \ --max_seq_len 128 --max_batch_size 4 NOTE: Redirects are currently not supported in Windows or MacOs.

initializing model parallel with size 1 initializing ddp with size 1 initializing pipeline with size 1 Traceback (most recent call last): File "/Volumes/Users/gitRepositories/llama/example_text_completion.py", line 56, in fire.Fire(main) File "/Users/shivkathpalia/venv/lib/python3.11/site-packages/fire/core.py", line 141, in Fire component_trace = _Fire(component, args, parsed_flag_args, context, name) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/shivkathpalia/venv/lib/python3.11/site-packages/fire/core.py", line 475, in _Fire component, remaining_args = _CallAndUpdateTrace( ^^^^^^^^^^^^^^^^^^^^ File "/Users/shivkathpalia/venv/lib/python3.11/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace component = fn(*varargs, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^ File "/Volumes/Users/gitRepositories/llama/example_text_completion.py", line 18, in main generator = Llama.build( ^^^^^^^^^^^^ File "/Volumes/Users/gitRepositories/llama/llama/generation.py", line 92, in build assert len(checkpoints) > 0, f"no checkpoint files found in {ckpt_dir}" ^^^^^^^^^^^^^^^^^^^^ AssertionError: no checkpoint files found in --ckpt_dir ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 63097) of binary: /Users/shivkathpalia/venv/bin/python Traceback (most recent call last): File "/Users/shivkathpalia/venv/bin/torchrun", line 8, in sys.exit(main()) ^^^^^^ File "/Users/shivkathpalia/venv/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper return f(args, **kwargs) ^^^^^^^^^^^^^^^^^^ File "/Users/shivkathpalia/venv/lib/python3.11/site-packages/torch/distributed/run.py", line 794, in main run(args) File "/Users/shivkathpalia/venv/lib/python3.11/site-packages/torch/distributed/run.py", line 785, in run elastic_launch( File "/Users/shivkathpalia/venv/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 134, in call return launch_agent(self._config, self._entrypoint, list(args)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/shivkathpalia/venv/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

example_text_completion.py FAILED

Failures:
------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2023-07-22_10:06:34 host : 1.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.ip6.arpa rank : 0 (local_rank: 0) exitcode : 1 (pid: 63097) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================

pzim-devdata commented 1 year ago

I have solved it with a cpu installation by installing this : https://github.com/krychu/llama instead of https://github.com/facebookresearch/llama Complete process to install :

download the original version of Llama from : https://github.com/facebookresearch/llama and extract it to a llama-main folder
download th cpu version from : https://github.com/krychu/llama and extract it and replace files in the llama-main folder
run the download.sh script in a terminal, passing the URL provided when prompted to start the download
go to the llama-main folder
cretate an Python3 env : python3 -m venv env and activate it : source env/bin/activate
install the cpu version of pytorch : python3 -m pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu #pour la version cpu
install dependencies off llama : python3 -m pip install -e .

run if you have downloaded llama-2-7b :

torchrun --nproc_per_node 1 example_text_completion.py \
--ckpt_dir llama-2-7b/ \
--tokenizer_path tokenizer.model \
--max_seq_len 128 --max_batch_size 1 #(instead of 4)

meta-llama / llama

Unable to run example program - example_text_completion.py #433

example_text_completion.py FAILED