skypilot-org / skypilot

SkyPilot: Run LLMs, AI, and Batch jobs on any cloud. Get maximum savings, highest GPU availability, and managed execution—all with a simple interface.
https://skypilot.readthedocs.io
Apache License 2.0
6.21k stars 427 forks source link

[Bug][UX] Meaning of `DEVICE_MEM` for multi-GPU instance type is not aligned in `sky show-gpus` #3434

Open cblmemo opened 3 months ago

cblmemo commented 3 months ago

In the current master (d0f20abaa58d6da3876c58363fb1390c5d32a7a2), the meaning of DEVICE_MEM in sky show-gpus seems not aligned. For example, in AWS, it represent the total device memory across all GPUs; while in GCP, it represents the device memory for a single GPU.

(skyserve) ➜  skypilot git:(master) ✗ sky -c          
skypilot, commit d0f20abaa58d6da3876c58363fb1390c5d32a7a2
(skyserve) ➜  skypilot git:(master) ✗ sky show-gpus L4
GPU  QTY  CLOUD   INSTANCE_TYPE   DEVICE_MEM  vCPUs  HOST_MEM  HOURLY_PRICE  HOURLY_SPOT_PRICE  REGION     
L4   1    AWS     g6.16xlarge     22GB        64     256GB     $ 3.397       $ 0.345            us-west-2  
L4   1    AWS     g6.2xlarge      22GB        8      32GB      $ 0.978       $ 0.100            us-east-2  
L4   1    AWS     g6.4xlarge      22GB        16     64GB      $ 1.323       $ 0.135            us-east-2  
L4   1    AWS     g6.8xlarge      22GB        32     128GB     $ 2.014       $ 0.205            us-east-2  
L4   1    AWS     g6.xlarge       22GB        4      16GB      $ 0.805       $ 0.081            us-east-1  
L4   1    AWS     gr6.4xlarge     22GB        16     128GB     $ 1.539       $ 0.156            us-east-2  
L4   1    AWS     gr6.8xlarge     22GB        32     256GB     $ 2.446       $ 0.248            us-east-1  
L4   4    AWS     g6.12xlarge     89GB        48     192GB     $ 4.602       $ 0.468            us-east-1  
L4   4    AWS     g6.24xlarge     89GB        96     384GB     $ 6.675       $ 0.671            us-west-2  
L4   8    AWS     g6.48xlarge     179GB       192    768GB     $ 13.350      $ 1.359            us-west-2  
L4   1    GCP     g2-standard-4   24GB        4      16GB      $ 0.705       $ 0.248            us-east4   
L4   2    GCP     g2-standard-24  24GB        24     96GB      $ 1.994       $ 0.730            us-east4   
L4   4    GCP     g2-standard-48  24GB        48     192GB     $ 3.989       $ 1.459            us-east4   
L4   8    GCP     g2-standard-96  24GB        96     384GB     $ 7.977       $ 2.918            us-east4   
L4   1    RunPod  1x_L4_SECURE    -           4      24GB      $ 0.440       -                  CA         
L4   2    RunPod  2x_L4_SECURE    -           8      48GB      $ 0.880       -                  CA         
L4   4    RunPod  4x_L4_SECURE    -           16     96GB      $ 1.760       -                  CA         
L4   8    RunPod  8x_L4_SECURE    -           32     192GB     $ 3.520       -                  CA         

GPU  QTY  CLOUD       INSTANCE_TYPE              DEVICE_MEM  vCPUs  HOST_MEM  HOURLY_PRICE  HOURLY_SPOT_PRICE  REGION            
L40  1    RunPod      1x_L40_SECURE              -           16     48GB      $ 1.140       $ 0.690            CA                
L40  2    RunPod      2x_L40_SECURE              -           32     96GB      $ 2.280       $ 1.380            CA                
L40  4    RunPod      4x_L40_SECURE              -           64     192GB     $ 4.560       $ 2.760            CA                
L40  8    RunPod      8x_L40_SECURE              -           128    384GB     $ 9.120       $ 5.520            CA                
L40  1    Fluidstack  recVcAEL8UwVgZWP5WNrQJN8r  48GB        32     60GB      $ 1.761       -                  generic_1_canada  
L40  2    Fluidstack  recSCaaKigbSg5MQPPVNoH9nG  48GB        64     120GB     $ 3.508       -                  generic_1_canada  
L40  4    Fluidstack  reciAySCoQSubyQ2atsxqFRxK  48GB        126    240GB     $ 7.002       -                  generic_1_canada  
L40  8    Fluidstack  recBNHGKPfgVSjm7hhThid8wu  48GB        252    480GB     $ 14.001      -                  generic_1_canada  
concretevitamin commented 3 months ago

Good catch @cblmemo! The intention should be per-device memory. Seems like we should correct those values, and/or also rename the field to prevent such misunderstanding in the future.