skypilot-org / skypilot

SkyPilot: Run AI and batch jobs on any infra (Kubernetes or 12+ clouds). Get unified execution, cost savings, and high GPU availability via a simple interface.
https://skypilot.readthedocs.io
Apache License 2.0
6.44k stars 459 forks source link

[Serve] GCP crendential path error with docker image and replica #3325

Open sean-styleai opened 5 months ago

sean-styleai commented 5 months ago

Can you help me to launch sky serve auto scaling with docker image?

launch command like below:

sky serve up -n {service name} --env-file {env file path} service.yaml

servcie.yaml like below:

# service.yaml
service:
  readiness_probe: /health
  replica_policy:
    min_replicas: 1
    max_replicas: 4
    target_qps_per_replica: 3
    upscale_delay_seconds: 180
    downscale_delay_seconds: 900

# Fields below describe each replica.
resources:
  cloud: GCP
  ports: 8000
  accelerators: L4

workdir: .

setup: docker login -u ${DOCKER_ID} -p ${DOCKER_PW} {docker image repository}

run: docker run -v ~/models/:/usr/app/models -p 8000:8000 -e ENV=prod  --runtime=nvidia --gpus all {docker image path}

Error occurs with replica provisioned. maybe gcp credential not exist error.

I 03-18 05:34:02 replica_managers.py:118] Failed to launch the sky serve replica cluster with error: subprocess.CalledProcessError: Command 'pushd /tmp &>/dev/null &&     { gcloud --help > /dev/null 2>&1 ||     { mkdir -p ~/.sky/logs &&     wget --quiet https://dl.google.com/dl/cloudsdk/channels/rapid/downloads/google-cloud-sdk-424.0.0-linux-x86_64.tar.gz > ~/.sky/logs/gcloud_installation.log &&     tar xzf google-cloud-sdk-424.0.0-linux-x86_64.tar.gz >> ~/.sky/logs/gcloud_installation.log &&     rm -rf ~/google-cloud-sdk >> ~/.sky/logs/gcloud_installation.log  &&     mv google-cloud-sdk ~/ &&     ~/google-cloud-sdk/install.sh -q >> ~/.sky/logs/gcloud_installation.log 2>&1 &&     echo "source ~/google-cloud-sdk/path.bash.inc > /dev/null 2>&1" >> ~/.bashrc &&     source ~/google-cloud-sdk/path.bash.inc >> ~/.sky/logs/gcloud_installation.log 2>&1; }; } &&     popd &>/dev/null && [[ "$(uname)" == "Darwin" ]] && skypilot_gsutil() { gsutil -m -o "GSUtil:parallel_process_count=1" "$@"; } || skypilot_gsutil() { gsutil -m "$@"; }; GOOGLE_APPLICATION_CREDENTIALS=~/.config/gcloud/application_default_credentials.json skypilot_gsutil ls -d gs://skypilot-workdir-namsangho-a20c2158' returned non-zero exit status 1.)
I 03-18 05:34:02 replica_managers.py:121]   Traceback: Traceback (most recent call last):
I 03-18 05:34:02 replica_managers.py:121]   File "/opt/conda/lib/python3.10/site-packages/sky/serve/replica_managers.py", line 95, in launch_cluster
I 03-18 05:34:02 replica_managers.py:121]     sky.launch(task,
I 03-18 05:34:02 replica_managers.py:121]   File "/opt/conda/lib/python3.10/site-packages/sky/utils/common_utils.py", line 370, in _record
I 03-18 05:34:02 replica_managers.py:121]     return f(*args, **kwargs)
I 03-18 05:34:02 replica_managers.py:121]   File "/opt/conda/lib/python3.10/site-packages/sky/utils/common_utils.py", line 370, in _record
I 03-18 05:34:02 replica_managers.py:121]     return f(*args, **kwargs)
I 03-18 05:34:02 replica_managers.py:121]   File "/opt/conda/lib/python3.10/site-packages/sky/execution.py", line 501, in launch
I 03-18 05:34:02 replica_managers.py:121]     return _execute(
I 03-18 05:34:02 replica_managers.py:121]   File "/opt/conda/lib/python3.10/site-packages/sky/execution.py", line 334, in _execute
I 03-18 05:34:02 replica_managers.py:121]     backend.sync_file_mounts(handle, task.file_mounts,
I 03-18 05:34:02 replica_managers.py:121]   File "/opt/conda/lib/python3.10/site-packages/sky/utils/common_utils.py", line 370, in _record
I 03-18 05:34:02 replica_managers.py:121]     return f(*args, **kwargs)
I 03-18 05:34:02 replica_managers.py:121]   File "/opt/conda/lib/python3.10/site-packages/sky/utils/common_utils.py", line 349, in _record
I 03-18 05:34:02 replica_managers.py:121]     return f(*args, **kwargs)
I 03-18 05:34:02 replica_managers.py:121]   File "/opt/conda/lib/python3.10/site-packages/sky/backends/backend.py", line 73, in sync_file_mounts
I 03-18 05:34:02 replica_managers.py:121]     return self._sync_file_mounts(handle, all_file_mounts, storage_mounts)
I 03-18 05:34:02 replica_managers.py:121]   File "/opt/conda/lib/python3.10/site-packages/sky/backends/cloud_vm_ray_backend.py", line 2990, in _sync_file_mounts
I 03-18 05:34:02 replica_managers.py:121]     self._execute_file_mounts(handle, all_file_mounts)
I 03-18 05:34:02 replica_managers.py:121]   File "/opt/conda/lib/python3.10/site-packages/sky/backends/cloud_vm_ray_backend.py", line 4341, in _execute_file_mounts
I 03-18 05:34:02 replica_managers.py:121]     if storage.is_directory(src):
I 03-18 05:34:02 replica_managers.py:121]   File "/opt/conda/lib/python3.10/site-packages/sky/cloud_stores.py", line 116, in is_directory
I 03-18 05:34:02 replica_managers.py:121]     p = subprocess.run(command,
I 03-18 05:34:02 replica_managers.py:121]   File "/opt/conda/lib/python3.10/subprocess.py", line 526, in run
I 03-18 05:34:02 replica_managers.py:121]     raise CalledProcessError(retcode, process.args,
I 03-18 05:34:02 replica_managers.py:121] subprocess.CalledProcessError: Command 'pushd /tmp &>/dev/null &&     { gcloud --help > /dev/null 2>&1 ||     { mkdir -p ~/.sky/logs &&     wget --quiet https://dl.google.com/dl/cloudsdk/channels/rapid/downloads/google-cloud-sdk-424.0.0-linux-x86_64.tar.gz > ~/.sky/logs/gcloud_installation.log &&     tar xzf google-cloud-sdk-424.0.0-linux-x86_64.tar.gz >> ~/.sky/logs/gcloud_installation.log &&     rm -rf ~/google-cloud-sdk >> ~/.sky/logs/gcloud_installation.log  &&     mv google-cloud-sdk ~/ &&     ~/google-cloud-sdk/install.sh -q >> ~/.sky/logs/gcloud_installation.log 2>&1 &&     echo "source ~/google-cloud-sdk/path.bash.inc > /dev/null 2>&1" >> ~/.bashrc &&     source ~/google-cloud-sdk/path.bash.inc >> ~/.sky/logs/gcloud_installation.log 2>&1; }; } &&     popd &>/dev/null && [[ "$(uname)" == "Darwin" ]] && skypilot_gsutil() { gsutil -m -o "GSUtil:parallel_process_count=1" "$@"; } || skypilot_gsutil() { gsutil -m "$@"; }; GOOGLE_APPLICATION_CREDENTIALS=~/.config/gcloud/application_default_credentials.json skypilot_gsutil ls -d gs://skypilot-workdir-namsangho-a20c2158' returned non-zero exit status 1.
concretevitamin commented 5 months ago

cc @cblmemo

cblmemo commented 5 months ago

Hi @sean-styleai ! Thanks for reporting the issue. Could you try to directly sky launch this YAML and to see if the error persists? Also, could you share the output of sky status in your local laptop (for more information on SkyServe Controller spec)?

sean-styleai commented 5 months ago

Thank you for so fast response! @concretevitamin @cblmemo @cblmemo Error mights persist same?

last few lines of sky launch

I 03-19 16:20:42 provisioner.py:451] Successfully provisioned or found existing instance.
I 03-19 16:21:31 provisioner.py:553] Successfully provisioned cluster: sky-878e-namsangho
⠸ Launching - Opening new portsWARNING:googleapiclient.http:Encountered 403 Forbidden with reason "PERMISSION_DENIED"
I 03-19 16:21:47 cloud_vm_ray_backend.py:2968] Syncing workdir (to 1 node): . -> ~/sky_workdir
I 03-19 16:21:47 cloud_vm_ray_backend.py:2976] To view detailed progress: tail -n100 -f ~/sky_logs/sky-2024-03-19-16-10-03-637914/workdir_sync.log
I 03-19 16:22:19 cloud_vm_ray_backend.py:3076] Running setup on 1 node.
bash: !dwk: event not found
WARNING! Using --password via the CLI is insecure. Use --password-stdin.
Error response from daemon: Get "https://registry-1.docker.io/v2/": unauthorized: incorrect username or password
Clusters
NAME                LAUNCHED     RESOURCES                                               STATUS  AUTOSTOP  COMMAND
sky-878e-namsangho  40 secs ago  1x GCP(g2-standard-4[Spot], {'L4': 1}, ports=['8000'])  UP      -         sky launch -n studio_api ...

sky.exceptions.CommandError: Command /bin/bash -i /tmp/sky_setup_sky-2024-03-19-16-10-03-637914 2>&1 failed with return code 1.
Failed to setup with return code 1. Check the details in log: ~/sky_logs/sky-2024-03-19-16-10-03-637914/setup-34.73.11.91.log

****** START Last lines of setup output ******
bash: !dwk: event not found
WARNING! Using --password via the CLI is insecure. Use --password-stdin.
Error response from daemon: Get "https://registry-1.docker.io/v2/": unauthorized: incorrect username or password
******* END Last lines of setup output *******

output of sky status

❯ sky status
Clusters
NAME                LAUNCHED    RESOURCES                                               STATUS  AUTOSTOP  COMMAND
sky-878e-namsangho  2 mins ago  1x GCP(g2-standard-4[Spot], {'L4': 1}, ports=['8000'])  UP      -         sky launch -n studio_api ...

Managed spot jobs
No in-progress spot jobs. (See: sky spot -h)

Services
No live services. (See: sky serve -h)
cblmemo commented 5 months ago

Thank you for so fast response! @concretevitamin @cblmemo @cblmemo Error mights persist same?

last few lines of sky launch

I 03-19 16:20:42 provisioner.py:451] Successfully provisioned or found existing instance.
I 03-19 16:21:31 provisioner.py:553] Successfully provisioned cluster: sky-878e-namsangho
⠸ Launching - Opening new portsWARNING:googleapiclient.http:Encountered 403 Forbidden with reason "PERMISSION_DENIED"
I 03-19 16:21:47 cloud_vm_ray_backend.py:2968] Syncing workdir (to 1 node): . -> ~/sky_workdir
I 03-19 16:21:47 cloud_vm_ray_backend.py:2976] To view detailed progress: tail -n100 -f ~/sky_logs/sky-2024-03-19-16-10-03-637914/workdir_sync.log
I 03-19 16:22:19 cloud_vm_ray_backend.py:3076] Running setup on 1 node.
bash: !dwk: event not found
WARNING! Using --password via the CLI is insecure. Use --password-stdin.
Error response from daemon: Get "https://registry-1.docker.io/v2/": unauthorized: incorrect username or password
Clusters
NAME                LAUNCHED     RESOURCES                                               STATUS  AUTOSTOP  COMMAND
sky-878e-namsangho  40 secs ago  1x GCP(g2-standard-4[Spot], {'L4': 1}, ports=['8000'])  UP      -         sky launch -n studio_api ...

sky.exceptions.CommandError: Command /bin/bash -i /tmp/sky_setup_sky-2024-03-19-16-10-03-637914 2>&1 failed with return code 1.
Failed to setup with return code 1. Check the details in log: ~/sky_logs/sky-2024-03-19-16-10-03-637914/setup-34.73.11.91.log

****** START Last lines of setup output ******
bash: !dwk: event not found
WARNING! Using --password via the CLI is insecure. Use --password-stdin.
Error response from daemon: Get "https://registry-1.docker.io/v2/": unauthorized: incorrect username or password
******* END Last lines of setup output *******

output of sky status

❯ sky status
Clusters
NAME                LAUNCHED    RESOURCES                                               STATUS  AUTOSTOP  COMMAND
sky-878e-namsangho  2 mins ago  1x GCP(g2-standard-4[Spot], {'L4': 1}, ports=['8000'])  UP      -         sky launch -n studio_api ...

Managed spot jobs
No in-progress spot jobs. (See: sky spot -h)

Services
No live services. (See: sky serve -h)

Humm, seems like the password is not correct? Could you successfully run the setup&run command in your local laptop?

sean-styleai commented 5 months ago

@cblmemo Oh I'm so sorry. Above result seem to caused by incorrect docker auth. I will recheck and update comments. Thank you!

sean-styleai commented 5 months ago

@cblmemo sky launch works fine!

few lines of output sky launch

I 03-19 17:19:23 provisioner.py:76] Launching on GCP us-east4 (us-east4-a)
I 03-19 17:22:10 provisioner.py:451] Successfully provisioned or found existing instance.
I 03-19 17:23:00 provisioner.py:553] Successfully provisioned cluster: sky-d8d7-namsangho
⠹ Launching - Opening new portsWARNING:googleapiclient.http:Encountered 403 Forbidden with reason "PERMISSION_DENIED"
I 03-19 17:23:16 cloud_vm_ray_backend.py:2968] Syncing workdir (to 1 node): . -> ~/sky_workdir
I 03-19 17:23:16 cloud_vm_ray_backend.py:2976] To view detailed progress: tail -n100 -f ~/sky_logs/sky-2024-03-19-17-19-14-933366/workdir_sync.log
I 03-19 17:23:43 cloud_vm_ray_backend.py:3076] Running setup on 1 node.
WARNING! Using --password via the CLI is insecure. Use --password-stdin.
WARNING! Your password will be stored unencrypted in /home/gcpuser/.docker/config.json.
Configure a credential helper to remove this warning. See
https://docs.docker.com/engine/reference/commandline/login/#credentials-store

Login Succeeded
I 03-19 17:23:50 cloud_vm_ray_backend.py:3089] Setup completed.
I 03-19 17:24:01 cloud_vm_ray_backend.py:3172] Job submitted with Job ID: 1
I 03-19 08:24:04 log_lib.py:392] Start streaming logs for job 1.
INFO: Tip: use Ctrl-C to exit log streaming (task will not be killed).
INFO: Waiting for task resources on 1 node. This will block if the cluster is full.
INFO: All task resources reserved.
INFO: Reserved IPs: ['10.150.0.11']

output of sky status

❯ sky status
Clusters
NAME                           LAUNCHED     RESOURCES                                                    STATUS  AUTOSTOP  COMMAND
sky-d8d7-namsangho             16 mins ago  1x GCP(g2-standard-4, {'L4': 1}, ports=['8000'])             UP      -         sky launch --env-file /Us...
sean-styleai commented 5 months ago

@cblmemo

same setting and retry sky serve up

controller seem to work fine, but Service Replicas seem to have same problem!

result of sky serve logs studio_api 1 and repetition of creating replica occurs

I 03-19 08:42:45 provisioner.py:553] Successfully provisioned cluster: studio_api-1
I 03-19 08:42:00 provisioner.py:451] Successfully provisioned or found existing instance.
I 03-19 08:42:45 provisioner.py:553] Successfully provisioned cluster: studio_api-1
I 03-19 08:42:53 cloud_vm_ray_backend.py:4266] Processing file mounts.

I 03-19 08:42:55 replica_managers.py:118] Failed to launch the sky serve replica cluster with error: subprocess.CalledProcessError: Command 'pushd /tmp &>/dev/null &&     { gcloud --help > /dev/null 2>&1 ||     { mkdir -p ~/.sky/logs &&     wget --quiet https://dl.google.com/dl/cloudsdk/channels/rapid/downloads/google-cloud-sdk-424.0.0-linux-x86_64.tar.gz > ~/.sky/logs/gcloud_installation.log &&     tar xzf google-cloud-sdk-424.0.0-linux-x86_64.tar.gz >> ~/.sky/logs/gcloud_installation.log &&     rm -rf ~/google-cloud-sdk >> ~/.sky/logs/gcloud_installation.log  &&     mv google-cloud-sdk ~/ &&     ~/google-cloud-sdk/install.sh -q >> ~/.sky/logs/gcloud_installation.log 2>&1 &&     echo "source ~/google-cloud-sdk/path.bash.inc > /dev/null 2>&1" >> ~/.bashrc &&     source ~/google-cloud-sdk/path.bash.inc >> ~/.sky/logs/gcloud_installation.log 2>&1; }; } &&     popd &>/dev/null && [[ "$(uname)" == "Darwin" ]] && skypilot_gsutil() { gsutil -m -o "GSUtil:parallel_process_count=1" "$@"; } || skypilot_gsutil() { gsutil -m "$@"; }; GOOGLE_APPLICATION_CREDENTIALS=~/.config/gcloud/application_default_credentials.json skypilot_gsutil ls -d gs://skypilot-workdir-namsangho-dc04aa38' returned non-zero exit status 1.)
I 03-19 08:42:55 replica_managers.py:121]   Traceback: Traceback (most recent call last):
I 03-19 08:42:55 replica_managers.py:121]   File "/opt/conda/lib/python3.10/site-packages/sky/serve/replica_managers.py", line 95, in launch_cluster
I 03-19 08:42:55 replica_managers.py:121]     sky.launch(task,
I 03-19 08:42:55 replica_managers.py:121]   File "/opt/conda/lib/python3.10/site-packages/sky/utils/common_utils.py", line 370, in _record
I 03-19 08:42:55 replica_managers.py:121]     return f(*args, **kwargs)
I 03-19 08:42:55 replica_managers.py:121]   File "/opt/conda/lib/python3.10/site-packages/sky/utils/common_utils.py", line 370, in _record
I 03-19 08:42:55 replica_managers.py:121]     return f(*args, **kwargs)
I 03-19 08:42:55 replica_managers.py:121]   File "/opt/conda/lib/python3.10/site-packages/sky/execution.py", line 501, in launch
I 03-19 08:42:55 replica_managers.py:121]     return _execute(
I 03-19 08:42:55 replica_managers.py:121]   File "/opt/conda/lib/python3.10/site-packages/sky/execution.py", line 334, in _execute
I 03-19 08:42:55 replica_managers.py:121]     backend.sync_file_mounts(handle, task.file_mounts,
I 03-19 08:42:55 replica_managers.py:121]   File "/opt/conda/lib/python3.10/site-packages/sky/utils/common_utils.py", line 370, in _record
I 03-19 08:42:55 replica_managers.py:121]     return f(*args, **kwargs)
I 03-19 08:42:55 replica_managers.py:121]   File "/opt/conda/lib/python3.10/site-packages/sky/utils/common_utils.py", line 349, in _record
I 03-19 08:42:55 replica_managers.py:121]     return f(*args, **kwargs)
I 03-19 08:42:55 replica_managers.py:121]   File "/opt/conda/lib/python3.10/site-packages/sky/backends/backend.py", line 73, in sync_file_mounts
I 03-19 08:42:55 replica_managers.py:121]     return self._sync_file_mounts(handle, all_file_mounts, storage_mounts)
I 03-19 08:42:55 replica_managers.py:121]   File "/opt/conda/lib/python3.10/site-packages/sky/backends/cloud_vm_ray_backend.py", line 2990, in _sync_file_mounts
I 03-19 08:42:55 replica_managers.py:121]     self._execute_file_mounts(handle, all_file_mounts)
I 03-19 08:42:55 replica_managers.py:121]   File "/opt/conda/lib/python3.10/site-packages/sky/backends/cloud_vm_ray_backend.py", line 4341, in _execute_file_mounts
I 03-19 08:42:55 replica_managers.py:121]     if storage.is_directory(src):
I 03-19 08:42:55 replica_managers.py:121]   File "/opt/conda/lib/python3.10/site-packages/sky/cloud_stores.py", line 116, in is_directory
I 03-19 08:42:55 replica_managers.py:121]     p = subprocess.run(command,
I 03-19 08:42:55 replica_managers.py:121]   File "/opt/conda/lib/python3.10/subprocess.py", line 526, in run
I 03-19 08:42:55 replica_managers.py:121]     raise CalledProcessError(retcode, process.args,
I 03-19 08:42:55 replica_managers.py:121] subprocess.CalledProcessError: Command 'pushd /tmp &>/dev/null &&     { gcloud --help > /dev/null 2>&1 ||     { mkdir -p ~/.sky/logs &&     wget --quiet https://dl.google.com/dl/cloudsdk/channels/rapid/downloads/google-cloud-sdk-424.0.0-linux-x86_64.tar.gz > ~/.sky/logs/gcloud_installation.log &&     tar xzf google-cloud-sdk-424.0.0-linux-x86_64.tar.gz >> ~/.sky/logs/gcloud_installation.log &&     rm -rf ~/google-cloud-sdk >> ~/.sky/logs/gcloud_installation.log  &&     mv google-cloud-sdk ~/ &&     ~/google-cloud-sdk/install.sh -q >> ~/.sky/logs/gcloud_installation.log 2>&1 &&     echo "source ~/google-cloud-sdk/path.bash.inc > /dev/null 2>&1" >> ~/.bashrc &&     source ~/google-cloud-sdk/path.bash.inc >> ~/.sky/logs/gcloud_installation.log 2>&1; }; } &&     popd &>/dev/null && [[ "$(uname)" == "Darwin" ]] && skypilot_gsutil() { gsutil -m -o "GSUtil:parallel_process_count=1" "$@"; } || skypilot_gsutil() { gsutil -m "$@"; }; GOOGLE_APPLICATION_CREDENTIALS=~/.config/gcloud/application_default_credentials.json skypilot_gsutil ls -d gs://skypilot-workdir-namsangho-dc04aa38' returned non-zero exit status 1.
cblmemo commented 5 months ago

@cblmemo

same setting and retry sky serve up

controller seem to work fine, but Service Replicas seem to have same problem!

result of sky serve logs studio_api 1 and repetition of creating replica occurs

I 03-19 08:42:45 provisioner.py:553] Successfully provisioned cluster: studio_api-1
I 03-19 08:42:00 provisioner.py:451] Successfully provisioned or found existing instance.
I 03-19 08:42:45 provisioner.py:553] Successfully provisioned cluster: studio_api-1
I 03-19 08:42:53 cloud_vm_ray_backend.py:4266] Processing file mounts.

I 03-19 08:42:55 replica_managers.py:118] Failed to launch the sky serve replica cluster with error: subprocess.CalledProcessError: Command 'pushd /tmp &>/dev/null &&     { gcloud --help > /dev/null 2>&1 ||     { mkdir -p ~/.sky/logs &&     wget --quiet https://dl.google.com/dl/cloudsdk/channels/rapid/downloads/google-cloud-sdk-424.0.0-linux-x86_64.tar.gz > ~/.sky/logs/gcloud_installation.log &&     tar xzf google-cloud-sdk-424.0.0-linux-x86_64.tar.gz >> ~/.sky/logs/gcloud_installation.log &&     rm -rf ~/google-cloud-sdk >> ~/.sky/logs/gcloud_installation.log  &&     mv google-cloud-sdk ~/ &&     ~/google-cloud-sdk/install.sh -q >> ~/.sky/logs/gcloud_installation.log 2>&1 &&     echo "source ~/google-cloud-sdk/path.bash.inc > /dev/null 2>&1" >> ~/.bashrc &&     source ~/google-cloud-sdk/path.bash.inc >> ~/.sky/logs/gcloud_installation.log 2>&1; }; } &&     popd &>/dev/null && [[ "$(uname)" == "Darwin" ]] && skypilot_gsutil() { gsutil -m -o "GSUtil:parallel_process_count=1" "$@"; } || skypilot_gsutil() { gsutil -m "$@"; }; GOOGLE_APPLICATION_CREDENTIALS=~/.config/gcloud/application_default_credentials.json skypilot_gsutil ls -d gs://skypilot-workdir-namsangho-dc04aa38' returned non-zero exit status 1.)
I 03-19 08:42:55 replica_managers.py:121]   Traceback: Traceback (most recent call last):
I 03-19 08:42:55 replica_managers.py:121]   File "/opt/conda/lib/python3.10/site-packages/sky/serve/replica_managers.py", line 95, in launch_cluster
I 03-19 08:42:55 replica_managers.py:121]     sky.launch(task,
I 03-19 08:42:55 replica_managers.py:121]   File "/opt/conda/lib/python3.10/site-packages/sky/utils/common_utils.py", line 370, in _record
I 03-19 08:42:55 replica_managers.py:121]     return f(*args, **kwargs)
I 03-19 08:42:55 replica_managers.py:121]   File "/opt/conda/lib/python3.10/site-packages/sky/utils/common_utils.py", line 370, in _record
I 03-19 08:42:55 replica_managers.py:121]     return f(*args, **kwargs)
I 03-19 08:42:55 replica_managers.py:121]   File "/opt/conda/lib/python3.10/site-packages/sky/execution.py", line 501, in launch
I 03-19 08:42:55 replica_managers.py:121]     return _execute(
I 03-19 08:42:55 replica_managers.py:121]   File "/opt/conda/lib/python3.10/site-packages/sky/execution.py", line 334, in _execute
I 03-19 08:42:55 replica_managers.py:121]     backend.sync_file_mounts(handle, task.file_mounts,
I 03-19 08:42:55 replica_managers.py:121]   File "/opt/conda/lib/python3.10/site-packages/sky/utils/common_utils.py", line 370, in _record
I 03-19 08:42:55 replica_managers.py:121]     return f(*args, **kwargs)
I 03-19 08:42:55 replica_managers.py:121]   File "/opt/conda/lib/python3.10/site-packages/sky/utils/common_utils.py", line 349, in _record
I 03-19 08:42:55 replica_managers.py:121]     return f(*args, **kwargs)
I 03-19 08:42:55 replica_managers.py:121]   File "/opt/conda/lib/python3.10/site-packages/sky/backends/backend.py", line 73, in sync_file_mounts
I 03-19 08:42:55 replica_managers.py:121]     return self._sync_file_mounts(handle, all_file_mounts, storage_mounts)
I 03-19 08:42:55 replica_managers.py:121]   File "/opt/conda/lib/python3.10/site-packages/sky/backends/cloud_vm_ray_backend.py", line 2990, in _sync_file_mounts
I 03-19 08:42:55 replica_managers.py:121]     self._execute_file_mounts(handle, all_file_mounts)
I 03-19 08:42:55 replica_managers.py:121]   File "/opt/conda/lib/python3.10/site-packages/sky/backends/cloud_vm_ray_backend.py", line 4341, in _execute_file_mounts
I 03-19 08:42:55 replica_managers.py:121]     if storage.is_directory(src):
I 03-19 08:42:55 replica_managers.py:121]   File "/opt/conda/lib/python3.10/site-packages/sky/cloud_stores.py", line 116, in is_directory
I 03-19 08:42:55 replica_managers.py:121]     p = subprocess.run(command,
I 03-19 08:42:55 replica_managers.py:121]   File "/opt/conda/lib/python3.10/subprocess.py", line 526, in run
I 03-19 08:42:55 replica_managers.py:121]     raise CalledProcessError(retcode, process.args,
I 03-19 08:42:55 replica_managers.py:121] subprocess.CalledProcessError: Command 'pushd /tmp &>/dev/null &&     { gcloud --help > /dev/null 2>&1 ||     { mkdir -p ~/.sky/logs &&     wget --quiet https://dl.google.com/dl/cloudsdk/channels/rapid/downloads/google-cloud-sdk-424.0.0-linux-x86_64.tar.gz > ~/.sky/logs/gcloud_installation.log &&     tar xzf google-cloud-sdk-424.0.0-linux-x86_64.tar.gz >> ~/.sky/logs/gcloud_installation.log &&     rm -rf ~/google-cloud-sdk >> ~/.sky/logs/gcloud_installation.log  &&     mv google-cloud-sdk ~/ &&     ~/google-cloud-sdk/install.sh -q >> ~/.sky/logs/gcloud_installation.log 2>&1 &&     echo "source ~/google-cloud-sdk/path.bash.inc > /dev/null 2>&1" >> ~/.bashrc &&     source ~/google-cloud-sdk/path.bash.inc >> ~/.sky/logs/gcloud_installation.log 2>&1; }; } &&     popd &>/dev/null && [[ "$(uname)" == "Darwin" ]] && skypilot_gsutil() { gsutil -m -o "GSUtil:parallel_process_count=1" "$@"; } || skypilot_gsutil() { gsutil -m "$@"; }; GOOGLE_APPLICATION_CREDENTIALS=~/.config/gcloud/application_default_credentials.json skypilot_gsutil ls -d gs://skypilot-workdir-namsangho-dc04aa38' returned non-zero exit status 1.

Thanks for reporting this! Could you share the output of sky -v and sky -c as well?

cblmemo commented 5 months ago

@sean-styleai Also, could you share current output of sky status that contains the contorller information as well?

sean-styleai commented 5 months ago

@cblmemo Here it is. Thank you for your fast response!

❯ sky -v
skypilot, version 1.0.0.dev20240317
❯ sky -c
skypilot, commit 823999af850ee93138f45d01abba6c54a93d3c1e

output of sky status

❯ sky status
Clusters
NAME                           LAUNCHED    RESOURCES                                                    STATUS  AUTOSTOP  COMMAND
sky-serve-controller-b61da251  2 mins ago  1x GCP(n2-standard-4, disk_size=200, ports=['30001-30100'])  UP      10m       sky serve up -n studio_api...

Managed spot jobs
No in-progress spot jobs. (See: sky spot -h)

Services
NAME        VERSION  UPTIME  STATUS      REPLICAS  ENDPOINT
studio_api  -        -       NO_REPLICA  0/1       34.172.38.176:30001

Service Replicas
SERVICE_NAME  ID  VERSION  IP  LAUNCHED  RESOURCES  STATUS        REGION
studio_api    1   1        -   -         -          PROVISIONING  -

* To see detailed service status: sky serve status -a
* 1 cluster has auto{stop,down} scheduled. Refresh statuses with: sky status --refresh
cblmemo commented 5 months ago

@cblmemo Here it is. Thank you for your fast response!

❯ sky -v
skypilot, version 1.0.0.dev20240317
❯ sky -c
skypilot, commit 823999af850ee93138f45d01abba6c54a93d3c1e

output of sky status

❯ sky status
Clusters
NAME                           LAUNCHED    RESOURCES                                                    STATUS  AUTOSTOP  COMMAND
sky-serve-controller-b61da251  2 mins ago  1x GCP(n2-standard-4, disk_size=200, ports=['30001-30100'])  UP      10m       sky serve up -n studio_api...

Managed spot jobs
No in-progress spot jobs. (See: sky spot -h)

Services
NAME        VERSION  UPTIME  STATUS      REPLICAS  ENDPOINT
studio_api  -        -       NO_REPLICA  0/1       34.172.38.176:30001

Service Replicas
SERVICE_NAME  ID  VERSION  IP  LAUNCHED  RESOURCES  STATUS        REGION
studio_api    1   1        -   -         -          PROVISIONING  -

* To see detailed service status: sky serve status -a
* 1 cluster has auto{stop,down} scheduled. Refresh statuses with: sky status --refresh

Humm, seems like I cannot reproduce this error in the same commit. Could you ssh to the controller and run the following command, and share the output with me?

pushd /tmp &>/dev/null &&     { gcloud --help > /dev/null 2>&1 ||     { mkdir -p ~/.sky/logs &&     wget --quiet https://dl.google.com/dl/cloudsdk/channels/rapid/downloads/google-cloud-sdk-424.0.0-linux-x86_64.tar.gz > ~/.sky/logs/gcloud_installation.log &&     tar xzf google-cloud-sdk-424.0.0-linux-x86_64.tar.gz >> ~/.sky/logs/gcloud_installation.log &&     rm -rf ~/google-cloud-sdk >> ~/.sky/logs/gcloud_installation.log  &&     mv google-cloud-sdk ~/ &&     ~/google-cloud-sdk/install.sh -q >> ~/.sky/logs/gcloud_installation.log 2>&1 &&     echo "source ~/google-cloud-sdk/path.bash.inc > /dev/null 2>&1" >> ~/.bashrc &&     source ~/google-cloud-sdk/path.bash.inc >> ~/.sky/logs/gcloud_installation.log 2>&1; }; } &&     popd &>/dev/null && [[ "$(uname)" == "Darwin" ]] && skypilot_gsutil() { gsutil -m -o "GSUtil:parallel_process_count=1" "$@"; } || skypilot_gsutil() { gsutil -m "$@"; }; GOOGLE_APPLICATION_CREDENTIALS=~/.config/gcloud/application_default_credentials.json skypilot_gsutil ls -d gs://skypilot-workdir-namsangho-dc04aa38
sean-styleai commented 5 months ago

@cblmemo Output of above command is like below:

Your "OAuth 2.0 Service Account" credentials are invalid. Please run
  $ gcloud auth login
OSError: No such file or directory.

After manually do gcloud auth login, I can get output like below:

gs://skypilot-workdir-namsangho-7141c640
gs://skypilot-workdir-namsangho-7141c640/.dockerignore
gs://skypilot-workdir-namsangho-7141c640/.gitignore
gs://skypilot-workdir-namsangho-7141c640/README.md
gs://skypilot-workdir-namsangho-7141c640/requirements-api-serverless.txt
gs://skypilot-workdir-namsangho-7141c640/requirements-api.txt
gs://skypilot-workdir-namsangho-7141c640/requirements-pipeline.txt
gs://skypilot-workdir-namsangho-7141c640/requirements.txt
gs://skypilot-workdir-namsangho-7141c640/assets/
gs://skypilot-workdir-namsangho-7141c640/dockerfiles/
gs://skypilot-workdir-namsangho-7141c640/infra/
gs://skypilot-workdir-namsangho-7141c640/notebooks/
gs://skypilot-workdir-namsangho-7141c640/scripts/
gs://skypilot-workdir-namsangho-7141c640/src/
cblmemo commented 5 months ago

@cblmemo Output of above command is like below:

Your "OAuth 2.0 Service Account" credentials are invalid. Please run
  $ gcloud auth login
OSError: No such file or directory.

After manually do gcloud auth login, I can get output like below:

gs://skypilot-workdir-namsangho-7141c640
gs://skypilot-workdir-namsangho-7141c640/.dockerignore
gs://skypilot-workdir-namsangho-7141c640/.gitignore
gs://skypilot-workdir-namsangho-7141c640/README.md
gs://skypilot-workdir-namsangho-7141c640/requirements-api-serverless.txt
gs://skypilot-workdir-namsangho-7141c640/requirements-api.txt
gs://skypilot-workdir-namsangho-7141c640/requirements-pipeline.txt
gs://skypilot-workdir-namsangho-7141c640/requirements.txt
gs://skypilot-workdir-namsangho-7141c640/assets/
gs://skypilot-workdir-namsangho-7141c640/dockerfiles/
gs://skypilot-workdir-namsangho-7141c640/infra/
gs://skypilot-workdir-namsangho-7141c640/notebooks/
gs://skypilot-workdir-namsangho-7141c640/scripts/
gs://skypilot-workdir-namsangho-7141c640/src/

Could you run the command on your local laptop again? If it also failed, that might be the reason...

sean-styleai commented 5 months ago

@cblmemo It works fine in local!

gs://skypilot-workdir-namsangho-89bfeef2/.dockerignore
gs://skypilot-workdir-namsangho-89bfeef2/.gitignore
gs://skypilot-workdir-namsangho-89bfeef2/README.md
gs://skypilot-workdir-namsangho-89bfeef2/requirements-api-serverless.txt
gs://skypilot-workdir-namsangho-89bfeef2/requirements-api.txt
gs://skypilot-workdir-namsangho-89bfeef2/requirements-pipeline.txt
gs://skypilot-workdir-namsangho-89bfeef2/requirements.txt
gs://skypilot-workdir-namsangho-89bfeef2/assets/
gs://skypilot-workdir-namsangho-89bfeef2/dockerfiles/
gs://skypilot-workdir-namsangho-89bfeef2/infra/
gs://skypilot-workdir-namsangho-89bfeef2/notebooks/
gs://skypilot-workdir-namsangho-89bfeef2/scripts/
gs://skypilot-workdir-namsangho-89bfeef2/src/
sean-styleai commented 5 months ago

@cblmemo Is this issue related with transmission of gcp sa data from controller to service replica?

cblmemo commented 5 months ago

@cblmemo Is this issue related with transmission of gcp sa data from controller to service replica?

Sorry for the late reply; was a little bit busy recently. Given that you cannot access your gcs storage on the controller, it seems more like not correctly sync SA credentials from local laptop to the controller. cc @Michaelvll for a look here 👀 does the SA credentials included in the following directory?

https://github.com/skypilot-org/skypilot/blob/acb49ee82e26f5a700747bb101d389cdf19a4d35/sky/clouds/gcp.py#L39-L41

GrelaM100 commented 4 months ago

Hi @sean-styleai, I experienced a similar issue with the managed spot jobs controller. What worked for me was deleting the gcloud directory in the .config directory. After that, I executed the sky spot launch command again, and everything worked as expected. It seems like this might be a workaround worth trying.

martin-liu commented 4 months ago

I experienced the same issue. Basically gsutil fails with Your "OAuth 2.0 Service Account" credentials are invalid in the controller vm. I SSH to the VM and tried several things:

Not an expert on this, any thoughts?

martin-liu commented 4 months ago

Tried to debug it in the controller VM, gsutil -D ls gives some more detail, then I found the credential gs_service_key_file in ~/.config/gcloud/legacy_credentials/xxx.iam.gserviceaccount.com/.boto points to my local laptop path (/Users/xxx/.config/gcloud/legacy_credentials/xxx.iam.gserviceaccount.com/adc.json) rather than /home/gcpuser/.config/...

So basically it copied all the credentials to remote without ensure the path is corrected in remote server.