spotty-cloud / spotty

Training deep learning models on AWS and GCP instances
https://spotty.cloud
MIT License
491 stars 43 forks source link

Unable to create GCP Cpu Instances #106

Open khumairraj opened 2 years ago

khumairraj commented 2 years ago

Hi there, Thank you for the amazing tool.

I have been trying to use spotty to make a CPU instance on GCP. Below is the spotty.yaml file which I am using.

project:
  name: spotty-heareval
  syncFilters:
    - exclude:
        - .idea/*
        - .git/*
        - '*/__pycache__/*'
        - _workdir/*
        - tasks/*
        - embeddings/*
        - .mypy_cache/*
        - lightning_logs/*
        - heareval.egg-info/*
        - pretrained/*
        - wandb/*

containers:
  - projectDir: /workspace/project
    image: alpine
    volumeMounts:
      - name: workspace
        mountPath: /workspace
    runtimeParameters: ['--shm-size', '20G']

instances:
  - name: spotty-heareval-dp-khumairraj
    provider: gcp
    parameters:
      zone: europe-west4-a
      machineType: n1-standard-1
      imageUri: projects/ml-images/global/images/c0-deeplearning-common-cpu-v20210818-debian-10
      volumes:
        - name: workspace
          parameters:
            size: 250
            mountDir: /workspace

The error that comes up is:

Waiting for the stack to be created...
  - launching the instance...
  - running the Docker container...
  Error:
  ------
  Deployment "spotty-instance-spotty-heareval-spotty-heareval-dp-khumairraj" failed.
  Error: {"ResourceType":"runtimeconfig.v1beta1.waiter","ResourceErrorCode":"412","ResourceErrorMessage":"Failure condition satisfied."}

Please let me know if I am missing something in the configuration, or a known solution. Thanks!

turian commented 2 years ago

@apls777 I am having the same issue with this spotty.yaml:

# You must delete disks manually on GCP :\
# https://console.cloud.google.com/compute/disks?project=hear2021-evaluation

project:
  name: hearpreprocess
  syncFilters:
    - exclude:
        - .idea/*
        - .git/*
        - '*/__pycache__/*'
        - _workdir/*
        - tasks/*
        - .mypy_cache/*
        - hearpreprocess.egg-info/*

containers:
  - projectDir: /workspace/project
    image: turian/hearpreprocess
    volumeMounts:
      - name: workspace
        mountPath: /workspace
    runtimeParameters: ['--shm-size', '20G']

instances:
  - name: hearpreprocess-i1-joseph
    provider: gcp
    parameters:
      zone: europe-west4-a
      machineType: c2-standard-16
      preemptibleInstance: False
      # gcloud compute images list
      # https://console.cloud.google.com/compute/images?project=hear2021-evaluation
      imageUri: projects/ubuntu-os-cloud/global/images/ubuntu-2004-focal-v20210825
      volumes:
        - name: workspace
          parameters:
            # Be careful to delete this if you're not using it!
            size: 2000
# Not implemented for GCP, all volumes will be retained
#            deletionPolicy: retain
            mountDir: /workspace

scripts:
  clean: |
    bash clean.sh
turian commented 2 years ago

@apls777 let me know if you have any ideas about this! Thank you

apls777 commented 2 years ago

@khumairraj @turian Sorry for the delay in getting back to you. The issue in both your configs is in the mountDir: /workspace parameter. Just remove it and it will work.

Usually, there is no need to specify the instances[]...volumes[]...mountDir parameter. This parameter customizes where the disk will be mounted on the host OS. By default, Spotty will mount your disk somewhere in the /mnt/... directory (and then this directory will be mounted inside your container to /workspace as specified in the containers[].volumeMounts[].mountPath parameter). It's a bug though because it should work even if a custom mountDir is specified, so I'll leave this issue open until it's fixed.

@khumairraj you also have another issue in your config. Spotty expects that bash is installed inside the Docker image, if not - you won't be able to connect to the container. So, don't use the raw alpine image, but you can create a custom Dockerfile, inherit alpine and install bash on top.

turian commented 2 years ago

@apls777 I tried this but it doesn't work yet. I have spotty running from master.

Here is my latest spotty.yaml, it's as above but I removed mountDir:

# You must delete disks manually on GCP :\
# https://console.cloud.google.com/compute/disks?project=hear2021-evaluation

project:
  name: hearpreprocess
  syncFilters:
    - exclude:
        - .idea/*
        - .git/*
        - '*/__pycache__/*'
        - _workdir/*
        - tasks/*
        - .mypy_cache/*
        - hearpreprocess.egg-info/*

containers:
  - projectDir: /workspace/project
    image: turian/hearpreprocess
    volumeMounts:
      - name: workspace
        mountPath: /workspace
    runtimeParameters: ['--shm-size', '32G']

instances:
  - name: hearpreprocess-cpu-joseph
    provider: gcp
    parameters:
      zone: europe-west4-a
      machineType: c2-standard-16
      preemptibleInstance: False
      # gcloud compute images list
      # https://console.cloud.google.com/compute/images?project=hear2021-evaluation
      imageUri: projects/ubuntu-os-cloud/global/images/ubuntu-2004-focal-v20210825
      volumes:
        - name: workspace
          parameters:
            # Be careful to delete this if you're not using it!
            size: 2000
# Not implemented for GCP, all volumes will be retained
#            deletionPolicy: retain
#            mountDir: /workspace

scripts:
  clean: |
    bash clean.sh

After spotty sh it says that docker is not found.

I run spotty start -C and I get:

CommandException: arg (/mnt/hearpreprocess-hearpreprocess-cpu-joseph-workspace/project) does not name a directory, bucket, or bucket subdir.
If there is an object with the same path, please add a trailing
slash to specify the directory.
Connection to 34.91.169.30 closed.
Error:
------
Failed to download files from the bucket to the instance

Why? Note that I changed the instance name to make sure I have a fresh disk, as discussed in #108

khumairraj commented 2 years ago
project:
  name: hearprep
  syncFilters:
    - exclude:
        - .idea/*
        - .git/*
        - '*/__pycache__/*'
        - _workdir/*
        - tasks/*
        - .mypy_cache/*
        - hearpreprocess.egg-info/*

containers:
  - projectDir: /workspace/project
    image: turian/hearpreprocess
    volumeMounts:
      - name: workspace
        mountPath: /workspace
    runtimeParameters: ['--shm-size', '32G']

instances:
  - name: hearprep-cpui2-delkhumair
    provider: gcp
    parameters:
      zone: europe-west4-a
      machineType: c2-standard-16
      preemptibleInstance: False
      imageUri: projects/ubuntu-os-cloud/global/images/ubuntu-2004-focal-v20210825
      volumes:
        - name: workspace
          parameters:
            size: 1000

I also tried the above config and get the below error -

Creating disks...
  - disk "hearprep-hearprep-cpui2-delkhumair-workspace" was created

Preparing the deployment template...
  - image URL: projects/ubuntu-os-cloud/global/images/ubuntu-2004-focal-v20210825
  - zone: europe-west4-a
  - on-demand VM
  - no GPUs

Volumes:
+-----------+------------+------+-----------------+
| Name      | Mount Path | Type | Deletion Policy |
+===========+============+======+=================+
| workspace | /workspace | Disk | Retain Volume   |
+-----------+------------+------+-----------------+

Waiting for the stack to be created...
  - launching the instance...
  - running the Docker container...
  Error:
  ------
  Deployment "spotty-instance-hearprep-hearprep-cpui2-delkhumair" failed.
  Error: {"ResourceType":"runtimeconfig.v1beta1.waiter","ResourceErrorCode":"412","ResourceErrorMessage":"Failure condition satisfied."}

Is there something I am missing? Thanks again for all your help

apls777 commented 2 years ago

@khumairraj Most likely, it's an issue with the GCP image. Try to update it to the latest one - see my reply here.

turian commented 2 years ago

@apls777 I upgraded to the latest Ubuntu image but it doesn't have docker by default, like the deep learning images usually do:

# You must delete disks manually on GCP :\
# https://console.cloud.google.com/compute/disks?project=hear2021-evaluation

project:
  name: hearpreprocess
  syncFilters:
    - exclude:
        - .idea/*
        - .git/*
        - '*/__pycache__/*'
        - _workdir/*
        - tasks/*
        - .mypy_cache/*
        - hearpreprocess.egg-info/*

containers:
  - projectDir: /workspace/project
    image: turian/hearpreprocess
    volumeMounts:
      - name: workspace
        mountPath: /workspace
    runtimeParameters: ['--shm-size', '32G']

instances:
  - name: hearpreprocess-cpu-joseph
    provider: gcp
    parameters:
      zone: europe-west4-a
      #machineType: c2-standard-16
      #machineType: e2-standard-32
      machineType: c2-standard-60
      preemptibleInstance: False
      # gcloud compute images list
      # https://console.cloud.google.com/compute/images?project=hear2021-evaluation
      imageUri: projects/ubuntu-os-cloud/global/images/ubuntu-2004-focal-v20211118
      volumes:
        - name: workspace
          parameters:
            size: 5000

scripts:
  clean: |
    bash clean.sh

Then spotty sh -H, then execute cat /var/log/startup-script.log

bash: docker: command not found
Container is not running.
Use the "spotty start -C" command to start it.

If instead I switch the image to the latest CPU deeplearning image:

      imageUri: projects/ml-images/global/images/c0-deeplearning-common-cpu-v20211118-debian-10

I get the following weird log error in /var/log/startup-script.log:

0 upgraded, 0 newly installed, 0 to remove and 5 not upgraded.
+ echo 'bind-key x kill-pane'
++ dirname /tmp/spotty/instance/scripts/container_bash.sh
+ mkdir -p /tmp/spotty/instance/scripts
+ cat
+ chmod +x /tmp/spotty/instance/scripts/container_bash.sh
+ CONTAINER_BASH_ALIAS=container
+ echo 'alias container="/tmp/spotty/instance/scripts/container_bash.sh"'
+ echo 'alias container="/tmp/spotty/instance/scripts/container_bash.sh"'
+ mkdir -pm 777 /tmp/spotty
+ mkdir -pm 777 /tmp/spotty/containers
+ /tmp/spotty/instance/scripts/startup/02_mount_volumes.sh
+ DEVICE_NAMES=("disk-1")
+ MOUNT_DIRS=("/mnt/hearpreprocess-hearpreprocess-cpu-joseph-workspace")
+ for i in ${!DEVICE_NAMES[*]}
+ DEVICE=/dev/disk/by-id/google-disk-1
+ MOUNT_DIR=/mnt/hearpreprocess-hearpreprocess-cpu-joseph-workspace
+ blkid -o value -s TYPE /dev/disk/by-id/google-disk-1
+ mkfs -t ext4 /dev/disk/by-id/google-disk-1
mke2fs 1.44.5 (15-Dec-2018)
/dev/disk/by-id/google-disk-1 is apparently in use by the system; will not make a filesystem here!

and spotty start -C gives:

CommandException: arg (/mnt/hearpreprocess-hearpreprocess-cpu-joseph-workspace/project) does not name a directory, bucket, or bucket subdir.
If there is an object with the same path, please add a trailing
slash to specify the directory.