spotty-cloud / spotty

Training deep learning models on AWS and GCP instances
https://spotty.cloud
MIT License
491 stars 43 forks source link

GCP example missing. #102

Open mcartagenah opened 3 years ago

mcartagenah commented 3 years ago

Hi, I couldn't find a gcp example besides the one in #68 , but I get the following error:

Preparing the deployment template...
  Error:
  ------
  <HttpError 404 when requesting https://compute.googleapis.com/compute/v1/projects/ml-images/global/images/family/common-gce-gpu-image?alt=json returned "The resource 'projects/ml-images/global/images/family/common-gce-gpu-image' was not found". Details: "[{'message': "The resource 'projects/ml-images/global/images/family/common-gce-gpu-image' was not found", 'domain': 'global', 'reason': 'notFound'}]">

What am I doing wrong?

turian commented 3 years ago

@apls777 I have the same issue and it's urgent :(

This is my spotty.yaml, could you show a simple working GCP spotty yaml?

project:
  name: spotty-heareval
  syncFilters:
    - exclude:
        - .idea/*
        - .git/*
        - '*/__pycache__/*'
        - _workdir
        - embeddings

containers:
  - projectDir: /workspace/project
    file: Dockerfile
#    ports:
#      # TensorBoard
#      - containerPort: 6006
#        hostPort: 6006
#      # Jupyter
#      - containerPort: 8888
#        hostPort: 8888
    volumeMounts:
      - name: workspace
        mountPath: /workspace
    runtimeParameters: ['--shm-size', '20G']

instances:
  - name: spotty-heareval-i1
    provider: gcp
    parameters:
      zone: europe-west4-a
      # A100 TODO: Try others
      machineType: a2-highgpu-1g
#      spotInstance: True
#      ports: [6006, 8888]
      dockerDataRoot: /docker
      volumes:
        - name: workspace
          parameters:
            size: 250
# Not implemented for GCP, all volumes will be retained
#            deletionPolicy: retain
            mountDir: /workspace
        - name: docker
          parameters:
            size: 20
            mountDir: /docker

scripts:
  setup: |
    bash setup.sh
  train: |
    bash train.sh
#  tensorboard: |
#    tensorboard --bind_all --port 6006 --logdir /workspace/project/logs
#  jupyter: |
#    jupyter notebook --allow-root --ip 0.0.0.0 --notebook-dir=/workspace/project
apls777 commented 3 years ago

@turian I'll have a look at it later today. I think GCP just renamed their GPU images.

turian commented 3 years ago

thank you!

apls777 commented 3 years ago

@turian @mcartagenah I’ll update the code later, but for now, you can just add this line to the instance parameters:

imageUri: projects/ml-images/global/images/family/common-dl-gpu-debian-10

JFYI: when I was working with GCP, I couldn’t use preemptible (spot) GPU instances as they were immediately shut down after launch. But I had a good experience using on-demand CPU instances with preemptible TPUs. If you still want to give it a try, use the preemptibleInstance: true parameter instead of spotInstance: true.

Also, keep in mind that I tested GCP a lot less than AWS, and it looks like not many people actually using it, so you might find some bugs.

turian commented 3 years ago

@apls777 great! I am happy to file bugs and share things that are successful to help other spotty users. We have a grant from GCP to run leaderboard evaluations for our NeurIPS competition: https://neuralaudio.ai/hear2021-holistic-evaluation-of-audio-representations.html

I would be very excited to use spotty because it will radically simplify the evaluation workflow.

Regarding preemptible instances, "Note: If you are requesting a Preemptible GPU quota for NVIDIA® V100® GPUs, in the justification for the request, specify that the request is for preemptible GPUs." (https://cloud.google.com/compute/docs/gpus)

A few more questions trying to get GCP GPUs running through spotty 0) How do I figure out what version of CUDA is running? From the full image URL? "projects/ml-images/global/images/c0-deeplearning-common-cu110-v20210818-debian-10" 1) common-dl-gpu-debian-10 where did you find this documented?? I googled this but can't find it documented. Is there a way to pick Ubuntu? Is there a way to change CUDA version? I would love to look these questions up. [edit: Ah weird, I find it here: https://console.cloud.google.com/compute/images?project=hear2021-evaluation] 2) Is it imageUri now? The docs (https://spotty.cloud/docs/providers/gcp/instance-parameters.html) say imageUrl.

apls777 commented 3 years ago

How do I figure out what version of CUDA is running? From the full image URL?

You can find the image in the GCP console by its name and check the CUDA version in the description: https://console.cloud.google.com/compute/imagesDetail/projects/ml-images/global/images/c0-deeplearning-common-cu110-v20210818-debian-10. It's using CUDA 11.0, so if you need to know it from the image URL, I guess it's the cu110 part.

common-dl-gpu-debian-10 where did you find this documented? ... [edit: Ah weird, I find it here: https://console.cloud.google.com/compute/images?project=hear2021-evaluation]

Yes, you can find it in the list of available images in the GCP console. But common-dl-gpu-debian-10 is the image family, not the image itself, so you need to look at the "Family" column. If you're not familiar with this concept: you can use a "family" image URL instead of a direct image URL to make sure you're always running the latest version of an image. At the moment, the latest version is c0-deeplearning-common-cu110-v20210818-debian-10.

Is there a way to pick Ubuntu? Is there a way to change CUDA version? I would love to look these questions up.

GCP doesn't support Ubuntu-based images with pre-installed Docker and CUDA, but you always can create your own Ubuntu image with any CUDA version and use it with Spotty using the imageUri parameter.

Is it imageUri now? The docs say imageUrl.

Yeah, I noticed it, it's a typo in the docs. Will fix it later.

turian commented 3 years ago

@mcartagenah can we close this issue?

turian commented 3 years ago

@apls777 just curious if you know how to use CUDA 11.1 with GCP images: https://console.cloud.google.com/compute/images?project=hear2021-evaluation

They all appear to be cu110. But, pytorch 1.9.0 builds are only against 11.1. CUDA 11.0 is supported only through pytorch 1.7.1

turian commented 3 years ago

Related https://github.com/spotty-cloud/spotty/issues/104

apls777 commented 3 years ago

Replied in #104

mcartagenah commented 3 years ago

@mcartagenah can we close this issue?

Yes, now it's working with the imageUri you pointed out.

Thank you :)

turian commented 3 years ago

@apls777 confirming that if you ask google for preemptible GPUs with your quota requests, they work with spotty