meta-llama / llama-stack-apps

Agentic components of the Llama Stack APIs
MIT License
3.82k stars 554 forks source link

worker_process_entrypoint FAILED #15

Open prashanthcheemala opened 3 months ago

prashanthcheemala commented 3 months ago

llama inference start /opt/LLama_Agentic_System/llama3_1venv/lib/python3.11/site-packages/llama_toolchain/utils.py:43: UserWarning: The version_base parameter is not specified. Please specify a compatability version level, or None. Will assume defaults for version 1.1 initialize(config_path=relative_path) Loading config from : /root/.llama/configs/inference.yaml Yaml config:

inference_config: impl_config: impl_type: inline checkpoint_config: checkpoint: checkpoint_type: pytorch checkpoint_dir: /root/.llama/checkpoints/Meta-Llama-3.1-8B-Instruct/original/ tokenizer_path: /root/.llama/checkpoints/Meta-Llama-3.1-8B-Instruct/original/tokenizer.model model_parallel_size: 1 quantization_format: bf16 quantization: null torch_seed: null max_seq_len: 16384 max_batch_size: 1


Listening on :::5000 INFO: Started server process [6765] INFO: Waiting for application startup.

initializing model parallel with size 1 initializing ddp with size 1 initializing pipeline with size 1 E0725 12:48:53.920000 127614093211456 torch/distributed/elastic/multiprocessing/api.py:702] failed (exitcode: -9) local_rank: 0 (pid: 6774) of fn: worker_process_entrypoint (start_method: fork) E0725 12:48:53.920000 127614093211456 torch/distributed/elastic/multiprocessing/api.py:702] Traceback (most recent call last): E0725 12:48:53.920000 127614093211456 torch/distributed/elastic/multiprocessing/api.py:702] File "/opt/LLama_Agentic_System/llama3_1venv/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 659, in _poll E0725 12:48:53.920000 127614093211456 torch/distributed/elastic/multiprocessing/api.py:702] self._pc.join(-1) E0725 12:48:53.920000 127614093211456 torch/distributed/elastic/multiprocessing/api.py:702] File "/opt/LLama_Agentic_System/llama3_1venv/lib/python3.11/site-packages/torch/multiprocessing/spawn.py", line 170, in join E0725 12:48:53.920000 127614093211456 torch/distributed/elastic/multiprocessing/api.py:702] raise ProcessExitedException( E0725 12:48:53.920000 127614093211456 torch/distributed/elastic/multiprocessing/api.py:702] torch.multiprocessing.spawn.ProcessExitedException: process 0 terminated with signal SIGKILL Process ForkProcess-1: Traceback (most recent call last): File "/root/anaconda3/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap self.run() File "/root/anaconda3/lib/python3.11/multiprocessing/process.py", line 108, in run self._target(*self._args, **self._kwargs) File "/opt/LLama_Agentic_System/llama3_1venv/lib/python3.11/site-packages/llama_toolchain/inference/parallel_utils.py", line 175, in launch_dist_group elastic_launch(launch_config, entrypoint=worker_process_entrypoint)( File "/opt/LLama_Agentic_System/llama3_1venv/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 133, in call return launch_agent(self._config, self._entrypoint, list(args)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/LLama_Agentic_System/llama3_1venv/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

worker_process_entrypoint FAILED

Failures:

----------------------------------------------------- Root Cause (first observed failure): [0]: time : 2024-07-25_12:48:53 host : ip-119-181-1-31.ec2.internal rank : 0 (local_rank: 0) exitcode : -9 (pid: 6774) error_file: traceback : Signal 9 (SIGKILL) received by PID 6774 =====================================================
Christopher-Dreamwalker commented 3 months ago

LLaMA says: Chat 🦙 I'm an open-source chatbot. I can explain concepts, write poems and code, solve logic puzzles, or even name your pets. What do you want to chat about? 🥸 Hi any idea how to fix: Root Cause (first observed failure): [0]: time : 2024-07-25_12:48:53 host : ip-119-181-1-31.ec2.internal rank : 0 (local_rank: 0) exitcode : -9 (pid: 6774) error_file: <N/A> traceback : Signal 9 (SIGKILL) received by PID 6774 🦙 The error message you're seeing typically indicates that your process was terminated by the operating system, usually due to resource constraints such as running out of memory. Here are some steps you can take to try and resolve the issue:

  1. Check system logs: Look into system logs (e.g., /var/log/syslog or /var/log/messages) around the time of the error (2024-07-25_12:48:53) to see if there are any messages indicating why the process was killed.
  2. Monitor system resources: Use tools like top, htop, or atop to monitor your system's resource usage (CPU, memory, disk, etc.) while running your application. This can help you identify if you're running low on any resources.
  3. Increase resources: If you're running your application on an AWS EC2 instance (as indicated by the hostname ip-119-181-1-31.ec2.internal), you might need to upgrade to a larger instance type with more memory or adjust your application's configuration to use fewer resources.
  4. Configure OOM Killer: If your process is being killed by the Out-of-Memory (OOM) Killer, you might want to consider configuring it to spare your critical processes or to start killing other, less important processes first.
  5. Check your application's memory usage: Use profiling tools to analyze your application's memory usage patterns. For example, if you're running a Python application, you could use tools like mprof or line_profiler to identify memory leaks.
  6. Implement retries or checkpoints: If your application is performing a long-running task, consider implementing retries or checkpoints so that if it's terminated, it can resume from where it left off.
  7. Check for any resource-intensive tasks: Review your code for any resource-intensive tasks that might be causing the system to run low on resources.
  8. System configuration: Ensure your system's configuration is optimal for your workload. This includes adjusting ulimits, kernel parameters, or other system settings. If none of these suggestions help, please provide more details about your application, its environment, and any relevant configurations, and I'll do my best to provide a more specific solution.