Open romilbhardwaj opened 1 month ago
Looks like our fp8 code has rotted a bit. It's a bad import. I will fix this up quick.
Thanks @ashwinb. I tried with bf16
and got the following:
(base) gcpuser@l4-2ea4-head-7te0mjrn-compute:~$ llama stack run 8b-instruct
Resolved 8 providers in topological order
Api.models: routing_table
Api.inference: router
Api.shields: routing_table
Api.safety: router
Api.memory_banks: routing_table
Api.memory: router
Api.agents: meta-reference
Api.telemetry: meta-reference
> initializing model parallel with size 1
> initializing ddp with size 1
> initializing pipeline with size 1
E1003 20:29:45.293000 140270447350976 torch/distributed/elastic/multiprocessing/api.py:702] failed (exitcode: -9) local_rank: 0 (pid: 22415) of fn: worker_process_entrypoint (start_method: fork)
E1003 20:29:45.293000 140270447350976 torch/distributed/elastic/multiprocessing/api.py:702] Traceback (most recent call last):
E1003 20:29:45.293000 140270447350976 torch/distributed/elastic/multiprocessing/api.py:702] File "/opt/conda/envs/llamastack-8b-instruct/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 659, in _poll
E1003 20:29:45.293000 140270447350976 torch/distributed/elastic/multiprocessing/api.py:702] self._pc.join(-1)
E1003 20:29:45.293000 140270447350976 torch/distributed/elastic/multiprocessing/api.py:702] File "/opt/conda/envs/llamastack-8b-instruct/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 170, in join
E1003 20:29:45.293000 140270447350976 torch/distributed/elastic/multiprocessing/api.py:702] raise ProcessExitedException(
E1003 20:29:45.293000 140270447350976 torch/distributed/elastic/multiprocessing/api.py:702] torch.multiprocessing.spawn.ProcessExitedException: process 0 terminated with signal SIGKILL
Process ForkProcess-1:
Traceback (most recent call last):
File "/opt/conda/envs/llamastack-8b-instruct/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/opt/conda/envs/llamastack-8b-instruct/lib/python3.10/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/opt/conda/envs/llamastack-8b-instruct/lib/python3.10/site-packages/llama_stack/providers/impls/meta_reference/inference/parallel_utils.py", line 175, in launch_dist_group
elastic_launch(launch_config, entrypoint=worker_process_entrypoint)(
File "/opt/conda/envs/llamastack-8b-instruct/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 133, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/opt/conda/envs/llamastack-8b-instruct/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
worker_process_entrypoint FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-10-03_20:29:45
host : l4-2ea4-head-7te0mjrn-compute.us-east4-a.c.skypilot-375900.internal
rank : 0 (local_rank: 0)
exitcode : -9 (pid: 22415)
error_file: <N/A>
traceback : Signal 9 (SIGKILL) received by PID 22415
============================================================
Is there some way to debug/view logs to find the reason for SIGKILL. FWIW, I'm trying to run Llama3.1-8B-Instruct
on 1x L4 GPU.
@romilbhardwaj A SIGKILL typically is due to an OOM. Could you perhaps try with Llama3.2-1B-Instruct
?
I'm following the getting started guide for llama-stack. When run
llama stack run 8b-instruct
, it fails withModuleNotFoundError
:Full logs, including the stack setup and configuration here: https://gist.github.com/romilbhardwaj/f21c3b1908b62ec5a906b321739d30cb
Versions:
Here's the full pip freeze: https://gist.github.com/romilbhardwaj/b05e950eeb03d5647d738382ba92f2a1
I tried with previous versions of llama_models, but that didn't work either. Am I missing something?