skypilot-org / skypilot

SkyPilot: Run AI and batch jobs on any infra (Kubernetes or 12+ clouds). Get unified execution, cost savings, and high GPU availability via a simple interface.
https://skypilot.readthedocs.io
Apache License 2.0
6.81k stars 512 forks source link

Storage: download big datasets using cheaper VMs vs. using GPU VMs? #696

Open Michaelvll opened 2 years ago

Michaelvll commented 2 years ago

The user would like to download a large dataset to the disk with a cheaper CPU instance, and then switch to a GPU instance with the same disk, so she can train on the dataset downloaded.

Michaelvll commented 2 years ago

A thought for the API could be using a chain dag and specify the data transfer between chain node using EBS.

chain_task:
  __data_transfer: ebs
  download_task:
    run: wget …
  train_task:
    resources: V100:8
    setup: …
    run: …
gmittal commented 2 years ago

Can we not use the mounted file system for this? e.g. spin up a cpunode where you begin the download with the file system mounted then launch your big GPU machine with the mounted FS also attached.

gmittal commented 2 years ago

This is actually something I do for projects/homework as well, where I change the GPU instance to a larger instance type when I need some more GPUs (same EBS volume) and downsize when I want to save costs but still want to run smaller experiments (e.g. change to T4 instance instead of V100). I usually do this by using "change instance type" in the AWS console.

I would appreciate this feature quite a bit!

concretevitamin commented 2 years ago

Can we not use the mounted file system for this? e.g. spin up a cpunode where you begin the download with the file system mounted then launch your big GPU machine with the mounted FS also attached.

The initial problem is already "S3 -> EBS". Currently mounted storage is S3 only - don't have support for mounting EBS/EFS for now. Agreed this is something we should investigate.

github-actions[bot] commented 1 year ago

This issue is stale because it has been open 120 days with no activity. Remove stale label or comment or this will be closed in 10 days.

infwinston commented 1 year ago

This issue is worth keeping and could be resolved with our python interface.

github-actions[bot] commented 1 year ago

This issue is stale because it has been open 120 days with no activity. Remove stale label or comment or this will be closed in 10 days.

github-actions[bot] commented 10 months ago

This issue is stale because it has been open 120 days with no activity. Remove stale label or comment or this will be closed in 10 days.

github-actions[bot] commented 10 months ago

This issue was closed because it has been stalled for 10 days with no activity.

github-actions[bot] commented 1 month ago

This issue is stale because it has been open 120 days with no activity. Remove stale label or comment or this will be closed in 10 days.

strict-type commented 1 month ago

@Michaelvll Is this planned on the roadmap? This will be very helpful for development and to save cost.