skypilot-org / skypilot

SkyPilot: Run AI and batch jobs on any infra (Kubernetes or 12+ clouds). Get unified execution, cost savings, and high GPU availability via a simple interface.
https://skypilot.readthedocs.io
Apache License 2.0
6.7k stars 495 forks source link

[Core] User setup causing SkyPilot runtime to fail #4097

Open Michaelvll opened 6 days ago

Michaelvll commented 6 days ago

The following task.yaml could cause failure of job submission, with sky launch -c test task.yaml

Reproduction

resources:
  cloud: aws
  disk_size: 256

num_nodes: 2

setup: |
  set -ex
  echo "setup stage begin"
  pip install --no-input img2dataset

run: |
  set -ex
  echo "run stage begin"

Reason

After logging into the cluster, it seems the issue is caused by installing img2dataset changed the numpy/pyarrow version in the base python environment, which somehow causes issue for skypilot-runtime in a different python venv.

ssh test
source ~/skypilot-runtime/bin/activate
ray job list
 JobDetails(type=<JobType.SUBMISSION: 'SUBMISSION'>, job_id=None, submission_id='5-ubuntu', driver_info=None, status=<JobStatus.FAILED: 'FAILED'>, entrypoint='/home/ubuntu/skypilot-runtime/bin/python -u ~/.sky/sky_app/sky_job_5 > ~/sky_logs/sky-2024-10-16-22-16-21-150087/run.log 2> /dev/null', message='Unexpected error occurred: The actor died because of an error raised in its creation task, \x1b[36mray::_ray_internal_job_actor_5-ubuntu:JobSupervisor.__init__()\x1b[39m (pid=4560, ip=172.31.83.142, actor_id=b812234ae2b45cf7b4ee51d501000000, repr=<ray.dashboard.modules.job.job_manager.JobSupervisor object at 0x7b86d8213250>)\n  File "/opt/conda/lib/python3.10/concurrent/futures/_base.py", line 451, in result\n    return self.__get_result()\n  File "/opt/conda/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result\n    raise self._exception\n  File "/home/ubuntu/skypilot-runtime/lib/python3.10/site-packages/ray/util/serialization_addons.py", line 39, in apply\n    _register_custom_datasets_serializers(serialization_context)\n  File "/opt/conda/lib/python3.10/site-packages/pyarrow/__init__.py", line 65, in <module>\n    import pyarrow.lib as _lib\n  File "pyarrow/lib.pyx", line 36, in init pyarrow.lib\nImportError: numpy.core.multiarray failed to import', error_type=None, start_time=1729117024917, end_time=1729117026116, metadata={}, runtime_env={}, driver_agent_http_address=None, driver_node_id=None, driver_exit_code=None)]

Potential fixes

We may need to be careful with the --system-site-packages option in our skypilot-runtime setup when creating the venv, as packages changed in the base env may affect skypilot runtime as well.

https://github.com/skypilot-org/skypilot/blob/53380e26f01452559012d57b333b17f40dd8a4d1/sky/skylet/constants.py#L158

Tested with removing such argument from the skypilot-runtime setup, and it seems the problem goes away. We should avoid this argument in our hosted image (cc'ing @yika-luo) and see if we should get rid of it for custom images as well (this may cause much longer provisioning time due to more packages to be installed instead of using the system existing ones).

yika-luo commented 6 days ago

Testing the impact on provisioning time now

Michaelvll commented 6 days ago

Testing the impact on provisioning time now

I suppose we should avoid this argument in our host image creation, i.e. the packer file. In that case, it should not affect the provisioning time?

yika-luo commented 13 hours ago

The latest custom images don't use --system-site-packages Also tested the example yaml and works fine