skypilot-org / skypilot

SkyPilot: Run AI and batch jobs on any infra (Kubernetes or 12+ clouds). Get unified execution, cost savings, and high GPU availability via a simple interface.
https://skypilot.readthedocs.io
Apache License 2.0
6.53k stars 465 forks source link

Random errors on GCP starting the node #3817

Open egafni opened 1 month ago

egafni commented 1 month ago

I am randomly getting the following error (maybe 1/10 jobs), and I do not know how to debug. Generally just submitting the exact same command works

The command is: sky launch -n job122 --down -r -y --memory=32 --gpus L4:1 devops/skypilot/job.yaml

... log_path = os.path.expanduser(os.path.join('"'"'~/sky_logs/sky-2024-08-08-06-30-57-542722/tasks'"'"', f'"'"'{rank}-{node_name}.log'"'"')) sky_env_vars_dict['"'"'SKYPILOT_NODE_RANK'"'"'] = rank

Backward compatibility: Environment starting with SKY_ is

# deprecated. Remove it in v0.9.0.
sky_env_vars_dict['"'"'SKY_NODE_RANK'"'"'] = rank

sky_env_vars_dict['"'"'SKYPILOT_INTERNAL_JOB_ID'"'"'] = 1
# Backward compatibility: Environment starting with `SKY_` is
# deprecated. Remove it in v0.9.0.
sky_env_vars_dict['"'"'SKY_INTERNAL_JOB_ID'"'"'] = 1

futures.append(run_bash_command_with_log \
        .options(name=name_str, num_cpus=0.5, resources={"L4": 1}, num_gpus=1, scheduling_strategy=ray.util.scheduling_strategies.PlacementGroupSchedulingStrategy(placement_group=pg, placement_group_bundle_index=0)) \
        .remote(
            script,
            log_path,
            env_vars=sky_env_vars_dict,
            stream_logs=True,
            with_ray=True,
        ))

returncodes = get_or_fail(futures, pg) if sum(returncodes) != 0: job_lib.set_status(1, job_lib.JobStatus.FAILED)

Schedule the next pending job immediately to make the job

# scheduling more efficient.
job_lib.scheduler.schedule_step()
# This waits for all streaming logs to finish.
time.sleep(0.5)
reason = '"'"''"'"'
# 139 is the return code of SIGSEGV, i.e. Segmentation Fault.
if any(r == 139 for r in returncodes):
    reason = '"'"'(likely due to Segmentation Fault)'"'"'
print('"'"'ERROR: Job 1 failed with '"'"'
      '"'"'return code list:'"'"',
      returncodes,
      reason,
      flush=True)
# Need this to set the job status in ray job to be FAILED.
sys.exit(1)

else: job_lib.set_status(1, job_lib.JobStatus.SUCCEEDED)

Schedule the next pending job immediately to make the job

# scheduling more efficient.
job_lib.scheduler.schedule_step()
# This waits for all streaming logs to finish.
time.sleep(0.5)

' > ~/.sky/sky_app/sky_job_1; } && $([ -s ~/.sky/python_path ] && cat ~/.sky/python_path 2> /dev/null || which python3) -u -c 'import os;import getpass;from sky.skylet import job_lib, log_lib, constants;job_owner_kwargs = {} if getattr(constants, "SKYLET_LIB_VERSION", 0) >= 1 else {"job_owner": getpass.getuser()};job_lib.scheduler.queue(1,'"'"'RAY_DASHBOARD_PORT=$($([ -s ~/.sky/python_path ] && cat ~/.sky/python_path 2> /dev/null || which python3) -c "from sky.skylet import job_lib; print(job_lib.get_job_submission_port())" 2> /dev/null || echo 8265);cd ~/sky_workdir && $([ -s ~/.sky/python_path ] && cat ~/.sky/python_path 2> /dev/null || which python3) $([ -s ~/.sky/ray_path ] && cat ~/.sky/ray_path 2> /dev/null || which ray) job submit --address=http://127.0.0.1:$RAY_DASHBOARD_PORT --submission-id 1-$(whoami) --no-wait "$([ -s ~/.sky/python_path ] && cat ~/.sky/python_path 2> /dev/null || which python3) -u ~/.sky/sky_app/sky_job_1 > ~/sky_logs/sky-2024-08-08-06-30-57-542722/run.log 2> /dev/null"'"'"')' failed with return code 255. Failed to submit job 1.

Version & Commit info:

romilbhardwaj commented 3 weeks ago

Hi @egafni - are you running on mac? To help us debug, next time this occurs, can you try either

a) reducing the length of your run section or b) Add a bunch of random commands (e.g., echo <really long string>) to your run section to make it larger than 120 KB.

and report if it solves the problem?

This may be related to our optimization which inlines the run section in a single SSH command : https://github.com/skypilot-org/skypilot/blob/master/sky/backends/cloud_vm_ray_backend.py#L141-L152

egafni commented 3 weeks ago

I'm almost always launching from ubuntu, so don't think this is the issue

On Tue, Aug 20, 2024 at 10:00 PM Romil Bhardwaj @.***> wrote:

Hi @egafni https://github.com/egafni - are you running on mac? To help us debug, next time this occurs, can you try either

a) reducing the length of your run section or b) Add a bunch of random commands (e.g., echo ) to your run section to make it larger than 120 KB.

and report if it solves the problem?

This may be related to our optimization which inlines the run section in a single SSH command :

https://github.com/skypilot-org/skypilot/blob/master/sky/backends/cloud_vm_ray_backend.py#L141-L152

— Reply to this email directly, view it on GitHub https://github.com/skypilot-org/skypilot/issues/3817#issuecomment-2301078217, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGDCCYQ6KWQZDGX3URL7QDZSQNHPAVCNFSM6AAAAABMF2EFOCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMBRGA3TQMRRG4 . You are receiving this because you were mentioned.Message ID: @.***>

Michaelvll commented 3 weeks ago

I'm almost always launching from ubuntu, so don't think this is the issue On Tue, Aug 20, 2024 at 10:00 PM Romil Bhardwaj @.> wrote: Hi @egafni https://github.com/egafni - are you running on mac? To help us debug, next time this occurs, can you try either a) reducing the length of your run section or b) Add a bunch of random commands (e.g., echo ) to your run section to make it larger than 120 KB. and report if it solves the problem? This may be related to our optimization which inlines the run section in a single SSH command : https://github.com/skypilot-org/skypilot/blob/master/sky/backends/cloud_vm_ray_backend.py#L141-L152 — Reply to this email directly, view it on GitHub <#3817 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGDCCYQ6KWQZDGX3URL7QDZSQNHPAVCNFSM6AAAAABMF2EFOCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMBRGA3TQMRRG4 . You are receiving this because you were mentioned.Message ID: @.>

Hi @egafni , it would be great if you can help share the entire log for the error above, so we can check if there is any error message in the early part. Thanks!