Open egafni opened 1 month ago
Hi @egafni - are you running on mac? To help us debug, next time this occurs, can you try either
a) reducing the length of your run section
or
b) Add a bunch of random commands (e.g., echo <really long string>
) to your run section to make it larger than 120 KB.
and report if it solves the problem?
This may be related to our optimization which inlines the run
section in a single SSH command :
https://github.com/skypilot-org/skypilot/blob/master/sky/backends/cloud_vm_ray_backend.py#L141-L152
I'm almost always launching from ubuntu, so don't think this is the issue
On Tue, Aug 20, 2024 at 10:00 PM Romil Bhardwaj @.***> wrote:
Hi @egafni https://github.com/egafni - are you running on mac? To help us debug, next time this occurs, can you try either
a) reducing the length of your run section or b) Add a bunch of random commands (e.g., echo
) to your run section to make it larger than 120 KB. and report if it solves the problem?
This may be related to our optimization which inlines the run section in a single SSH command :
https://github.com/skypilot-org/skypilot/blob/master/sky/backends/cloud_vm_ray_backend.py#L141-L152
— Reply to this email directly, view it on GitHub https://github.com/skypilot-org/skypilot/issues/3817#issuecomment-2301078217, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGDCCYQ6KWQZDGX3URL7QDZSQNHPAVCNFSM6AAAAABMF2EFOCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMBRGA3TQMRRG4 . You are receiving this because you were mentioned.Message ID: @.***>
I'm almost always launching from ubuntu, so don't think this is the issue … On Tue, Aug 20, 2024 at 10:00 PM Romil Bhardwaj @.> wrote: Hi @egafni https://github.com/egafni - are you running on mac? To help us debug, next time this occurs, can you try either a) reducing the length of your run section or b) Add a bunch of random commands (e.g., echo
) to your run section to make it larger than 120 KB. and report if it solves the problem? This may be related to our optimization which inlines the run section in a single SSH command : https://github.com/skypilot-org/skypilot/blob/master/sky/backends/cloud_vm_ray_backend.py#L141-L152 — Reply to this email directly, view it on GitHub <#3817 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGDCCYQ6KWQZDGX3URL7QDZSQNHPAVCNFSM6AAAAABMF2EFOCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMBRGA3TQMRRG4 . You are receiving this because you were mentioned.Message ID: @.>
Hi @egafni , it would be great if you can help share the entire log for the error above, so we can check if there is any error message in the early part. Thanks!
I am randomly getting the following error (maybe 1/10 jobs), and I do not know how to debug. Generally just submitting the exact same command works
The command is: sky launch -n job122 --down -r -y --memory=32 --gpus L4:1 devops/skypilot/job.yaml
... log_path = os.path.expanduser(os.path.join('"'"'~/sky_logs/sky-2024-08-08-06-30-57-542722/tasks'"'"', f'"'"'{rank}-{node_name}.log'"'"')) sky_env_vars_dict['"'"'SKYPILOT_NODE_RANK'"'"'] = rank
Backward compatibility: Environment starting with
SKY_
isreturncodes = get_or_fail(futures, pg) if sum(returncodes) != 0: job_lib.set_status(1, job_lib.JobStatus.FAILED)
Schedule the next pending job immediately to make the job
else: job_lib.set_status(1, job_lib.JobStatus.SUCCEEDED)
Schedule the next pending job immediately to make the job
' > ~/.sky/sky_app/sky_job_1; } && $([ -s ~/.sky/python_path ] && cat ~/.sky/python_path 2> /dev/null || which python3) -u -c 'import os;import getpass;from sky.skylet import job_lib, log_lib, constants;job_owner_kwargs = {} if getattr(constants, "SKYLET_LIB_VERSION", 0) >= 1 else {"job_owner": getpass.getuser()};job_lib.scheduler.queue(1,'"'"'RAY_DASHBOARD_PORT=$($([ -s ~/.sky/python_path ] && cat ~/.sky/python_path 2> /dev/null || which python3) -c "from sky.skylet import job_lib; print(job_lib.get_job_submission_port())" 2> /dev/null || echo 8265);cd ~/sky_workdir && $([ -s ~/.sky/python_path ] && cat ~/.sky/python_path 2> /dev/null || which python3) $([ -s ~/.sky/ray_path ] && cat ~/.sky/ray_path 2> /dev/null || which ray) job submit --address=http://127.0.0.1:$RAY_DASHBOARD_PORT --submission-id 1-$(whoami) --no-wait "$([ -s ~/.sky/python_path ] && cat ~/.sky/python_path 2> /dev/null || which python3) -u ~/.sky/sky_app/sky_job_1 > ~/sky_logs/sky-2024-08-08-06-30-57-542722/run.log 2> /dev/null"'"'"')' failed with return code 255. Failed to submit job 1.
Version & Commit info:
sky -v
: skypilot, version 1.0.0.dev20240807sky -c
: 51f1f78d8b45beaad2f89a9fb0fca2ca03350621