Open kgopfa opened 11 months ago
After completing setup for CodeLlama, from the README.md, when I attempt to run any of the examples, with the specified commands:
torchrun --nproc_per_node 1 example_completion.py --ckpt_dir CodeLlama-7b/ --tokenizer_path CodeLlama-7b/tokenizer.model --max_seq_len 128 --max_batch_size 4
OR
torchrun --nproc_per_node 1 example_infilling.py --ckpt_dir CodeLlama-7b/ --tokenizer_path CodeLlama-7b/tokenizer.model --max_seq_len 192 --max_batch_size 4
torchrun --nproc_per_node 1 example_instructions.py --ckpt_dir CodeLlama-7b-Instruct/ --tokenizer_path CodeLlama-7b-Instruct/tokenizer.model --max_seq_len 512 --max_batch_size 4
I get the output with the error below:
> initializing model parallel with size 1 > initializing ddp with size 1 > initializing pipeline with size 1 ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 0 (pid: 31383) of binary: /home/abc/miniconda3/envs/llama_env/bin/python Traceback (most recent call last): File "/home/abc/miniconda3/envs/llama_env/bin/torchrun", line 8, in <module> sys.exit(main()) File "/home/abc/miniconda3/envs/llama_env/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper return f(*args, **kwargs) File "/home/abc/miniconda3/envs/llama_env/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main run(args) File "/home/abc/miniconda3/envs/llama_env/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run elastic_launch( File "/home/abc/miniconda3/envs/llama_env/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/home/abc/miniconda3/envs/llama_env/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ====================================================== example_completion.py FAILED ------------------------------------------------------ Failures: <NO_OTHER_FAILURES> ------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2023-12-10_13:12:17 host : ABC-PC. rank : 0 (local_rank: 0) exitcode : -9 (pid: 31383) error_file: <N/A> traceback : Signal 9 (SIGKILL) received by PID 31383 ======================================================
CodeLlama-7b
CodeLlama-7b-Instruct
CodeLlama-7b-Python
Additional context I am trying to run the models on Ubuntu through WSL 2, I tried setting the batch size to 6 (--max_batch_size 6) as was mentioned in llama #706 but this did not help.
--max_batch_size 6
I met the same issue. I find that I ran out of RAM via htop. Try .wslconfig to enable more RAM.
htop
.wslconfig
Problem Description
After completing setup for CodeLlama, from the README.md, when I attempt to run any of the examples, with the specified commands:
OR
OR
I get the output with the error below:
Output
Runtime Environment
CodeLlama-7b
,CodeLlama-7b-Instruct
,CodeLlama-7b-Python
]Additional context I am trying to run the models on Ubuntu through WSL 2, I tried setting the batch size to 6 (
--max_batch_size 6
) as was mentioned in llama #706 but this did not help.