Open sean-styleai opened 5 months ago
cc @cblmemo
Hi @sean-styleai ! Thanks for reporting the issue. Could you try to directly sky launch
this YAML and to see if the error persists? Also, could you share the output of sky status
in your local laptop (for more information on SkyServe Controller spec)?
Thank you for so fast response! @concretevitamin @cblmemo @cblmemo Error mights persist same?
last few lines of sky launch
I 03-19 16:20:42 provisioner.py:451] Successfully provisioned or found existing instance.
I 03-19 16:21:31 provisioner.py:553] Successfully provisioned cluster: sky-878e-namsangho
⠸ Launching - Opening new portsWARNING:googleapiclient.http:Encountered 403 Forbidden with reason "PERMISSION_DENIED"
I 03-19 16:21:47 cloud_vm_ray_backend.py:2968] Syncing workdir (to 1 node): . -> ~/sky_workdir
I 03-19 16:21:47 cloud_vm_ray_backend.py:2976] To view detailed progress: tail -n100 -f ~/sky_logs/sky-2024-03-19-16-10-03-637914/workdir_sync.log
I 03-19 16:22:19 cloud_vm_ray_backend.py:3076] Running setup on 1 node.
bash: !dwk: event not found
WARNING! Using --password via the CLI is insecure. Use --password-stdin.
Error response from daemon: Get "https://registry-1.docker.io/v2/": unauthorized: incorrect username or password
Clusters
NAME LAUNCHED RESOURCES STATUS AUTOSTOP COMMAND
sky-878e-namsangho 40 secs ago 1x GCP(g2-standard-4[Spot], {'L4': 1}, ports=['8000']) UP - sky launch -n studio_api ...
sky.exceptions.CommandError: Command /bin/bash -i /tmp/sky_setup_sky-2024-03-19-16-10-03-637914 2>&1 failed with return code 1.
Failed to setup with return code 1. Check the details in log: ~/sky_logs/sky-2024-03-19-16-10-03-637914/setup-34.73.11.91.log
****** START Last lines of setup output ******
bash: !dwk: event not found
WARNING! Using --password via the CLI is insecure. Use --password-stdin.
Error response from daemon: Get "https://registry-1.docker.io/v2/": unauthorized: incorrect username or password
******* END Last lines of setup output *******
output of sky status
❯ sky status
Clusters
NAME LAUNCHED RESOURCES STATUS AUTOSTOP COMMAND
sky-878e-namsangho 2 mins ago 1x GCP(g2-standard-4[Spot], {'L4': 1}, ports=['8000']) UP - sky launch -n studio_api ...
Managed spot jobs
No in-progress spot jobs. (See: sky spot -h)
Services
No live services. (See: sky serve -h)
Thank you for so fast response! @concretevitamin @cblmemo @cblmemo Error mights persist same?
last few lines of
sky launch
I 03-19 16:20:42 provisioner.py:451] Successfully provisioned or found existing instance. I 03-19 16:21:31 provisioner.py:553] Successfully provisioned cluster: sky-878e-namsangho ⠸ Launching - Opening new portsWARNING:googleapiclient.http:Encountered 403 Forbidden with reason "PERMISSION_DENIED" I 03-19 16:21:47 cloud_vm_ray_backend.py:2968] Syncing workdir (to 1 node): . -> ~/sky_workdir I 03-19 16:21:47 cloud_vm_ray_backend.py:2976] To view detailed progress: tail -n100 -f ~/sky_logs/sky-2024-03-19-16-10-03-637914/workdir_sync.log I 03-19 16:22:19 cloud_vm_ray_backend.py:3076] Running setup on 1 node. bash: !dwk: event not found WARNING! Using --password via the CLI is insecure. Use --password-stdin. Error response from daemon: Get "https://registry-1.docker.io/v2/": unauthorized: incorrect username or password Clusters NAME LAUNCHED RESOURCES STATUS AUTOSTOP COMMAND sky-878e-namsangho 40 secs ago 1x GCP(g2-standard-4[Spot], {'L4': 1}, ports=['8000']) UP - sky launch -n studio_api ... sky.exceptions.CommandError: Command /bin/bash -i /tmp/sky_setup_sky-2024-03-19-16-10-03-637914 2>&1 failed with return code 1. Failed to setup with return code 1. Check the details in log: ~/sky_logs/sky-2024-03-19-16-10-03-637914/setup-34.73.11.91.log ****** START Last lines of setup output ****** bash: !dwk: event not found WARNING! Using --password via the CLI is insecure. Use --password-stdin. Error response from daemon: Get "https://registry-1.docker.io/v2/": unauthorized: incorrect username or password ******* END Last lines of setup output *******
output of
sky status
❯ sky status Clusters NAME LAUNCHED RESOURCES STATUS AUTOSTOP COMMAND sky-878e-namsangho 2 mins ago 1x GCP(g2-standard-4[Spot], {'L4': 1}, ports=['8000']) UP - sky launch -n studio_api ... Managed spot jobs No in-progress spot jobs. (See: sky spot -h) Services No live services. (See: sky serve -h)
Humm, seems like the password is not correct? Could you successfully run the setup&run command in your local laptop?
@cblmemo Oh I'm so sorry. Above result seem to caused by incorrect docker auth. I will recheck and update comments. Thank you!
@cblmemo
sky launch
works fine!
few lines of output sky launch
I 03-19 17:19:23 provisioner.py:76] Launching on GCP us-east4 (us-east4-a)
I 03-19 17:22:10 provisioner.py:451] Successfully provisioned or found existing instance.
I 03-19 17:23:00 provisioner.py:553] Successfully provisioned cluster: sky-d8d7-namsangho
⠹ Launching - Opening new portsWARNING:googleapiclient.http:Encountered 403 Forbidden with reason "PERMISSION_DENIED"
I 03-19 17:23:16 cloud_vm_ray_backend.py:2968] Syncing workdir (to 1 node): . -> ~/sky_workdir
I 03-19 17:23:16 cloud_vm_ray_backend.py:2976] To view detailed progress: tail -n100 -f ~/sky_logs/sky-2024-03-19-17-19-14-933366/workdir_sync.log
I 03-19 17:23:43 cloud_vm_ray_backend.py:3076] Running setup on 1 node.
WARNING! Using --password via the CLI is insecure. Use --password-stdin.
WARNING! Your password will be stored unencrypted in /home/gcpuser/.docker/config.json.
Configure a credential helper to remove this warning. See
https://docs.docker.com/engine/reference/commandline/login/#credentials-store
Login Succeeded
I 03-19 17:23:50 cloud_vm_ray_backend.py:3089] Setup completed.
I 03-19 17:24:01 cloud_vm_ray_backend.py:3172] Job submitted with Job ID: 1
I 03-19 08:24:04 log_lib.py:392] Start streaming logs for job 1.
INFO: Tip: use Ctrl-C to exit log streaming (task will not be killed).
INFO: Waiting for task resources on 1 node. This will block if the cluster is full.
INFO: All task resources reserved.
INFO: Reserved IPs: ['10.150.0.11']
output of sky status
❯ sky status
Clusters
NAME LAUNCHED RESOURCES STATUS AUTOSTOP COMMAND
sky-d8d7-namsangho 16 mins ago 1x GCP(g2-standard-4, {'L4': 1}, ports=['8000']) UP - sky launch --env-file /Us...
@cblmemo
same setting and retry sky serve up
controller seem to work fine, but Service Replicas seem to have same problem!
result of sky serve logs studio_api 1
and repetition of creating replica occurs
I 03-19 08:42:45 provisioner.py:553] Successfully provisioned cluster: studio_api-1
I 03-19 08:42:00 provisioner.py:451] Successfully provisioned or found existing instance.
I 03-19 08:42:45 provisioner.py:553] Successfully provisioned cluster: studio_api-1
I 03-19 08:42:53 cloud_vm_ray_backend.py:4266] Processing file mounts.
I 03-19 08:42:55 replica_managers.py:118] Failed to launch the sky serve replica cluster with error: subprocess.CalledProcessError: Command 'pushd /tmp &>/dev/null && { gcloud --help > /dev/null 2>&1 || { mkdir -p ~/.sky/logs && wget --quiet https://dl.google.com/dl/cloudsdk/channels/rapid/downloads/google-cloud-sdk-424.0.0-linux-x86_64.tar.gz > ~/.sky/logs/gcloud_installation.log && tar xzf google-cloud-sdk-424.0.0-linux-x86_64.tar.gz >> ~/.sky/logs/gcloud_installation.log && rm -rf ~/google-cloud-sdk >> ~/.sky/logs/gcloud_installation.log && mv google-cloud-sdk ~/ && ~/google-cloud-sdk/install.sh -q >> ~/.sky/logs/gcloud_installation.log 2>&1 && echo "source ~/google-cloud-sdk/path.bash.inc > /dev/null 2>&1" >> ~/.bashrc && source ~/google-cloud-sdk/path.bash.inc >> ~/.sky/logs/gcloud_installation.log 2>&1; }; } && popd &>/dev/null && [[ "$(uname)" == "Darwin" ]] && skypilot_gsutil() { gsutil -m -o "GSUtil:parallel_process_count=1" "$@"; } || skypilot_gsutil() { gsutil -m "$@"; }; GOOGLE_APPLICATION_CREDENTIALS=~/.config/gcloud/application_default_credentials.json skypilot_gsutil ls -d gs://skypilot-workdir-namsangho-dc04aa38' returned non-zero exit status 1.)
I 03-19 08:42:55 replica_managers.py:121] Traceback: Traceback (most recent call last):
I 03-19 08:42:55 replica_managers.py:121] File "/opt/conda/lib/python3.10/site-packages/sky/serve/replica_managers.py", line 95, in launch_cluster
I 03-19 08:42:55 replica_managers.py:121] sky.launch(task,
I 03-19 08:42:55 replica_managers.py:121] File "/opt/conda/lib/python3.10/site-packages/sky/utils/common_utils.py", line 370, in _record
I 03-19 08:42:55 replica_managers.py:121] return f(*args, **kwargs)
I 03-19 08:42:55 replica_managers.py:121] File "/opt/conda/lib/python3.10/site-packages/sky/utils/common_utils.py", line 370, in _record
I 03-19 08:42:55 replica_managers.py:121] return f(*args, **kwargs)
I 03-19 08:42:55 replica_managers.py:121] File "/opt/conda/lib/python3.10/site-packages/sky/execution.py", line 501, in launch
I 03-19 08:42:55 replica_managers.py:121] return _execute(
I 03-19 08:42:55 replica_managers.py:121] File "/opt/conda/lib/python3.10/site-packages/sky/execution.py", line 334, in _execute
I 03-19 08:42:55 replica_managers.py:121] backend.sync_file_mounts(handle, task.file_mounts,
I 03-19 08:42:55 replica_managers.py:121] File "/opt/conda/lib/python3.10/site-packages/sky/utils/common_utils.py", line 370, in _record
I 03-19 08:42:55 replica_managers.py:121] return f(*args, **kwargs)
I 03-19 08:42:55 replica_managers.py:121] File "/opt/conda/lib/python3.10/site-packages/sky/utils/common_utils.py", line 349, in _record
I 03-19 08:42:55 replica_managers.py:121] return f(*args, **kwargs)
I 03-19 08:42:55 replica_managers.py:121] File "/opt/conda/lib/python3.10/site-packages/sky/backends/backend.py", line 73, in sync_file_mounts
I 03-19 08:42:55 replica_managers.py:121] return self._sync_file_mounts(handle, all_file_mounts, storage_mounts)
I 03-19 08:42:55 replica_managers.py:121] File "/opt/conda/lib/python3.10/site-packages/sky/backends/cloud_vm_ray_backend.py", line 2990, in _sync_file_mounts
I 03-19 08:42:55 replica_managers.py:121] self._execute_file_mounts(handle, all_file_mounts)
I 03-19 08:42:55 replica_managers.py:121] File "/opt/conda/lib/python3.10/site-packages/sky/backends/cloud_vm_ray_backend.py", line 4341, in _execute_file_mounts
I 03-19 08:42:55 replica_managers.py:121] if storage.is_directory(src):
I 03-19 08:42:55 replica_managers.py:121] File "/opt/conda/lib/python3.10/site-packages/sky/cloud_stores.py", line 116, in is_directory
I 03-19 08:42:55 replica_managers.py:121] p = subprocess.run(command,
I 03-19 08:42:55 replica_managers.py:121] File "/opt/conda/lib/python3.10/subprocess.py", line 526, in run
I 03-19 08:42:55 replica_managers.py:121] raise CalledProcessError(retcode, process.args,
I 03-19 08:42:55 replica_managers.py:121] subprocess.CalledProcessError: Command 'pushd /tmp &>/dev/null && { gcloud --help > /dev/null 2>&1 || { mkdir -p ~/.sky/logs && wget --quiet https://dl.google.com/dl/cloudsdk/channels/rapid/downloads/google-cloud-sdk-424.0.0-linux-x86_64.tar.gz > ~/.sky/logs/gcloud_installation.log && tar xzf google-cloud-sdk-424.0.0-linux-x86_64.tar.gz >> ~/.sky/logs/gcloud_installation.log && rm -rf ~/google-cloud-sdk >> ~/.sky/logs/gcloud_installation.log && mv google-cloud-sdk ~/ && ~/google-cloud-sdk/install.sh -q >> ~/.sky/logs/gcloud_installation.log 2>&1 && echo "source ~/google-cloud-sdk/path.bash.inc > /dev/null 2>&1" >> ~/.bashrc && source ~/google-cloud-sdk/path.bash.inc >> ~/.sky/logs/gcloud_installation.log 2>&1; }; } && popd &>/dev/null && [[ "$(uname)" == "Darwin" ]] && skypilot_gsutil() { gsutil -m -o "GSUtil:parallel_process_count=1" "$@"; } || skypilot_gsutil() { gsutil -m "$@"; }; GOOGLE_APPLICATION_CREDENTIALS=~/.config/gcloud/application_default_credentials.json skypilot_gsutil ls -d gs://skypilot-workdir-namsangho-dc04aa38' returned non-zero exit status 1.
@cblmemo
same setting and retry
sky serve up
controller seem to work fine, but Service Replicas seem to have same problem!
result of
sky serve logs studio_api 1
and repetition of creating replica occursI 03-19 08:42:45 provisioner.py:553] Successfully provisioned cluster: studio_api-1 I 03-19 08:42:00 provisioner.py:451] Successfully provisioned or found existing instance. I 03-19 08:42:45 provisioner.py:553] Successfully provisioned cluster: studio_api-1 I 03-19 08:42:53 cloud_vm_ray_backend.py:4266] Processing file mounts. I 03-19 08:42:55 replica_managers.py:118] Failed to launch the sky serve replica cluster with error: subprocess.CalledProcessError: Command 'pushd /tmp &>/dev/null && { gcloud --help > /dev/null 2>&1 || { mkdir -p ~/.sky/logs && wget --quiet https://dl.google.com/dl/cloudsdk/channels/rapid/downloads/google-cloud-sdk-424.0.0-linux-x86_64.tar.gz > ~/.sky/logs/gcloud_installation.log && tar xzf google-cloud-sdk-424.0.0-linux-x86_64.tar.gz >> ~/.sky/logs/gcloud_installation.log && rm -rf ~/google-cloud-sdk >> ~/.sky/logs/gcloud_installation.log && mv google-cloud-sdk ~/ && ~/google-cloud-sdk/install.sh -q >> ~/.sky/logs/gcloud_installation.log 2>&1 && echo "source ~/google-cloud-sdk/path.bash.inc > /dev/null 2>&1" >> ~/.bashrc && source ~/google-cloud-sdk/path.bash.inc >> ~/.sky/logs/gcloud_installation.log 2>&1; }; } && popd &>/dev/null && [[ "$(uname)" == "Darwin" ]] && skypilot_gsutil() { gsutil -m -o "GSUtil:parallel_process_count=1" "$@"; } || skypilot_gsutil() { gsutil -m "$@"; }; GOOGLE_APPLICATION_CREDENTIALS=~/.config/gcloud/application_default_credentials.json skypilot_gsutil ls -d gs://skypilot-workdir-namsangho-dc04aa38' returned non-zero exit status 1.) I 03-19 08:42:55 replica_managers.py:121] Traceback: Traceback (most recent call last): I 03-19 08:42:55 replica_managers.py:121] File "/opt/conda/lib/python3.10/site-packages/sky/serve/replica_managers.py", line 95, in launch_cluster I 03-19 08:42:55 replica_managers.py:121] sky.launch(task, I 03-19 08:42:55 replica_managers.py:121] File "/opt/conda/lib/python3.10/site-packages/sky/utils/common_utils.py", line 370, in _record I 03-19 08:42:55 replica_managers.py:121] return f(*args, **kwargs) I 03-19 08:42:55 replica_managers.py:121] File "/opt/conda/lib/python3.10/site-packages/sky/utils/common_utils.py", line 370, in _record I 03-19 08:42:55 replica_managers.py:121] return f(*args, **kwargs) I 03-19 08:42:55 replica_managers.py:121] File "/opt/conda/lib/python3.10/site-packages/sky/execution.py", line 501, in launch I 03-19 08:42:55 replica_managers.py:121] return _execute( I 03-19 08:42:55 replica_managers.py:121] File "/opt/conda/lib/python3.10/site-packages/sky/execution.py", line 334, in _execute I 03-19 08:42:55 replica_managers.py:121] backend.sync_file_mounts(handle, task.file_mounts, I 03-19 08:42:55 replica_managers.py:121] File "/opt/conda/lib/python3.10/site-packages/sky/utils/common_utils.py", line 370, in _record I 03-19 08:42:55 replica_managers.py:121] return f(*args, **kwargs) I 03-19 08:42:55 replica_managers.py:121] File "/opt/conda/lib/python3.10/site-packages/sky/utils/common_utils.py", line 349, in _record I 03-19 08:42:55 replica_managers.py:121] return f(*args, **kwargs) I 03-19 08:42:55 replica_managers.py:121] File "/opt/conda/lib/python3.10/site-packages/sky/backends/backend.py", line 73, in sync_file_mounts I 03-19 08:42:55 replica_managers.py:121] return self._sync_file_mounts(handle, all_file_mounts, storage_mounts) I 03-19 08:42:55 replica_managers.py:121] File "/opt/conda/lib/python3.10/site-packages/sky/backends/cloud_vm_ray_backend.py", line 2990, in _sync_file_mounts I 03-19 08:42:55 replica_managers.py:121] self._execute_file_mounts(handle, all_file_mounts) I 03-19 08:42:55 replica_managers.py:121] File "/opt/conda/lib/python3.10/site-packages/sky/backends/cloud_vm_ray_backend.py", line 4341, in _execute_file_mounts I 03-19 08:42:55 replica_managers.py:121] if storage.is_directory(src): I 03-19 08:42:55 replica_managers.py:121] File "/opt/conda/lib/python3.10/site-packages/sky/cloud_stores.py", line 116, in is_directory I 03-19 08:42:55 replica_managers.py:121] p = subprocess.run(command, I 03-19 08:42:55 replica_managers.py:121] File "/opt/conda/lib/python3.10/subprocess.py", line 526, in run I 03-19 08:42:55 replica_managers.py:121] raise CalledProcessError(retcode, process.args, I 03-19 08:42:55 replica_managers.py:121] subprocess.CalledProcessError: Command 'pushd /tmp &>/dev/null && { gcloud --help > /dev/null 2>&1 || { mkdir -p ~/.sky/logs && wget --quiet https://dl.google.com/dl/cloudsdk/channels/rapid/downloads/google-cloud-sdk-424.0.0-linux-x86_64.tar.gz > ~/.sky/logs/gcloud_installation.log && tar xzf google-cloud-sdk-424.0.0-linux-x86_64.tar.gz >> ~/.sky/logs/gcloud_installation.log && rm -rf ~/google-cloud-sdk >> ~/.sky/logs/gcloud_installation.log && mv google-cloud-sdk ~/ && ~/google-cloud-sdk/install.sh -q >> ~/.sky/logs/gcloud_installation.log 2>&1 && echo "source ~/google-cloud-sdk/path.bash.inc > /dev/null 2>&1" >> ~/.bashrc && source ~/google-cloud-sdk/path.bash.inc >> ~/.sky/logs/gcloud_installation.log 2>&1; }; } && popd &>/dev/null && [[ "$(uname)" == "Darwin" ]] && skypilot_gsutil() { gsutil -m -o "GSUtil:parallel_process_count=1" "$@"; } || skypilot_gsutil() { gsutil -m "$@"; }; GOOGLE_APPLICATION_CREDENTIALS=~/.config/gcloud/application_default_credentials.json skypilot_gsutil ls -d gs://skypilot-workdir-namsangho-dc04aa38' returned non-zero exit status 1.
Thanks for reporting this! Could you share the output of sky -v
and sky -c
as well?
@sean-styleai Also, could you share current output of sky status
that contains the contorller information as well?
@cblmemo Here it is. Thank you for your fast response!
❯ sky -v
skypilot, version 1.0.0.dev20240317
❯ sky -c
skypilot, commit 823999af850ee93138f45d01abba6c54a93d3c1e
output of sky status
❯ sky status
Clusters
NAME LAUNCHED RESOURCES STATUS AUTOSTOP COMMAND
sky-serve-controller-b61da251 2 mins ago 1x GCP(n2-standard-4, disk_size=200, ports=['30001-30100']) UP 10m sky serve up -n studio_api...
Managed spot jobs
No in-progress spot jobs. (See: sky spot -h)
Services
NAME VERSION UPTIME STATUS REPLICAS ENDPOINT
studio_api - - NO_REPLICA 0/1 34.172.38.176:30001
Service Replicas
SERVICE_NAME ID VERSION IP LAUNCHED RESOURCES STATUS REGION
studio_api 1 1 - - - PROVISIONING -
* To see detailed service status: sky serve status -a
* 1 cluster has auto{stop,down} scheduled. Refresh statuses with: sky status --refresh
@cblmemo Here it is. Thank you for your fast response!
❯ sky -v skypilot, version 1.0.0.dev20240317 ❯ sky -c skypilot, commit 823999af850ee93138f45d01abba6c54a93d3c1e
output of
sky status
❯ sky status Clusters NAME LAUNCHED RESOURCES STATUS AUTOSTOP COMMAND sky-serve-controller-b61da251 2 mins ago 1x GCP(n2-standard-4, disk_size=200, ports=['30001-30100']) UP 10m sky serve up -n studio_api... Managed spot jobs No in-progress spot jobs. (See: sky spot -h) Services NAME VERSION UPTIME STATUS REPLICAS ENDPOINT studio_api - - NO_REPLICA 0/1 34.172.38.176:30001 Service Replicas SERVICE_NAME ID VERSION IP LAUNCHED RESOURCES STATUS REGION studio_api 1 1 - - - PROVISIONING - * To see detailed service status: sky serve status -a * 1 cluster has auto{stop,down} scheduled. Refresh statuses with: sky status --refresh
Humm, seems like I cannot reproduce this error in the same commit. Could you ssh to the controller and run the following command, and share the output with me?
pushd /tmp &>/dev/null && { gcloud --help > /dev/null 2>&1 || { mkdir -p ~/.sky/logs && wget --quiet https://dl.google.com/dl/cloudsdk/channels/rapid/downloads/google-cloud-sdk-424.0.0-linux-x86_64.tar.gz > ~/.sky/logs/gcloud_installation.log && tar xzf google-cloud-sdk-424.0.0-linux-x86_64.tar.gz >> ~/.sky/logs/gcloud_installation.log && rm -rf ~/google-cloud-sdk >> ~/.sky/logs/gcloud_installation.log && mv google-cloud-sdk ~/ && ~/google-cloud-sdk/install.sh -q >> ~/.sky/logs/gcloud_installation.log 2>&1 && echo "source ~/google-cloud-sdk/path.bash.inc > /dev/null 2>&1" >> ~/.bashrc && source ~/google-cloud-sdk/path.bash.inc >> ~/.sky/logs/gcloud_installation.log 2>&1; }; } && popd &>/dev/null && [[ "$(uname)" == "Darwin" ]] && skypilot_gsutil() { gsutil -m -o "GSUtil:parallel_process_count=1" "$@"; } || skypilot_gsutil() { gsutil -m "$@"; }; GOOGLE_APPLICATION_CREDENTIALS=~/.config/gcloud/application_default_credentials.json skypilot_gsutil ls -d gs://skypilot-workdir-namsangho-dc04aa38
@cblmemo Output of above command is like below:
Your "OAuth 2.0 Service Account" credentials are invalid. Please run
$ gcloud auth login
OSError: No such file or directory.
After manually do gcloud auth login
, I can get output like below:
gs://skypilot-workdir-namsangho-7141c640
gs://skypilot-workdir-namsangho-7141c640/.dockerignore
gs://skypilot-workdir-namsangho-7141c640/.gitignore
gs://skypilot-workdir-namsangho-7141c640/README.md
gs://skypilot-workdir-namsangho-7141c640/requirements-api-serverless.txt
gs://skypilot-workdir-namsangho-7141c640/requirements-api.txt
gs://skypilot-workdir-namsangho-7141c640/requirements-pipeline.txt
gs://skypilot-workdir-namsangho-7141c640/requirements.txt
gs://skypilot-workdir-namsangho-7141c640/assets/
gs://skypilot-workdir-namsangho-7141c640/dockerfiles/
gs://skypilot-workdir-namsangho-7141c640/infra/
gs://skypilot-workdir-namsangho-7141c640/notebooks/
gs://skypilot-workdir-namsangho-7141c640/scripts/
gs://skypilot-workdir-namsangho-7141c640/src/
@cblmemo Output of above command is like below:
Your "OAuth 2.0 Service Account" credentials are invalid. Please run $ gcloud auth login OSError: No such file or directory.
After manually do
gcloud auth login
, I can get output like below:gs://skypilot-workdir-namsangho-7141c640 gs://skypilot-workdir-namsangho-7141c640/.dockerignore gs://skypilot-workdir-namsangho-7141c640/.gitignore gs://skypilot-workdir-namsangho-7141c640/README.md gs://skypilot-workdir-namsangho-7141c640/requirements-api-serverless.txt gs://skypilot-workdir-namsangho-7141c640/requirements-api.txt gs://skypilot-workdir-namsangho-7141c640/requirements-pipeline.txt gs://skypilot-workdir-namsangho-7141c640/requirements.txt gs://skypilot-workdir-namsangho-7141c640/assets/ gs://skypilot-workdir-namsangho-7141c640/dockerfiles/ gs://skypilot-workdir-namsangho-7141c640/infra/ gs://skypilot-workdir-namsangho-7141c640/notebooks/ gs://skypilot-workdir-namsangho-7141c640/scripts/ gs://skypilot-workdir-namsangho-7141c640/src/
Could you run the command on your local laptop again? If it also failed, that might be the reason...
@cblmemo It works fine in local!
gs://skypilot-workdir-namsangho-89bfeef2/.dockerignore
gs://skypilot-workdir-namsangho-89bfeef2/.gitignore
gs://skypilot-workdir-namsangho-89bfeef2/README.md
gs://skypilot-workdir-namsangho-89bfeef2/requirements-api-serverless.txt
gs://skypilot-workdir-namsangho-89bfeef2/requirements-api.txt
gs://skypilot-workdir-namsangho-89bfeef2/requirements-pipeline.txt
gs://skypilot-workdir-namsangho-89bfeef2/requirements.txt
gs://skypilot-workdir-namsangho-89bfeef2/assets/
gs://skypilot-workdir-namsangho-89bfeef2/dockerfiles/
gs://skypilot-workdir-namsangho-89bfeef2/infra/
gs://skypilot-workdir-namsangho-89bfeef2/notebooks/
gs://skypilot-workdir-namsangho-89bfeef2/scripts/
gs://skypilot-workdir-namsangho-89bfeef2/src/
@cblmemo Is this issue related with transmission of gcp sa data from controller to service replica?
@cblmemo Is this issue related with transmission of gcp sa data from controller to service replica?
Sorry for the late reply; was a little bit busy recently. Given that you cannot access your gcs storage on the controller, it seems more like not correctly sync SA credentials from local laptop to the controller. cc @Michaelvll for a look here 👀 does the SA credentials included in the following directory?
Hi @sean-styleai, I experienced a similar issue with the managed spot jobs controller. What worked for me was deleting the gcloud
directory in the .config
directory. After that, I executed the sky spot launch
command again, and everything worked as expected. It seems like this might be a workaround worth trying.
I experienced the same issue. Basically gsutil
fails with Your "OAuth 2.0 Service Account" credentials are invalid
in the controller vm.
I SSH to the VM and tried several things:
gcloud storage ls
works but gsutil ls
failshttps://dl.google.com/dl/cloudsdk/channels/rapid/downloads/google-cloud-sdk-474.0.0-linux-x86_64.tar.gz
, but gsutil ls
still failapt update
then apt upgrade google-cloud-sdk
, then gsutil ls
worksNot an expert on this, any thoughts?
Tried to debug it in the controller VM, gsutil -D ls
gives some more detail, then I found the credential gs_service_key_file
in ~/.config/gcloud/legacy_credentials/xxx.iam.gserviceaccount.com/.boto
points to my local laptop path (/Users/xxx/.config/gcloud/legacy_credentials/xxx.iam.gserviceaccount.com/adc.json
) rather than /home/gcpuser/.config/...
So basically it copied all the credentials to remote without ensure the path is corrected in remote server.
Can you help me to launch sky serve auto scaling with docker image?
launch command like below:
servcie.yaml like below:
Error occurs with replica provisioned. maybe gcp credential not exist error.