ray-project / kuberay

A toolkit to run Ray applications on Kubernetes
Apache License 2.0
974 stars 330 forks source link

[perf-tests] make the bucket name and prefix configurable for ray data image resize job #2156

Closed andrewsykim closed 1 month ago

andrewsykim commented 1 month ago

Why are these changes needed?

This PR makes the bucket and bucket prefix configurable in the rat-data-image-resize sample job.

I also refactored ray_data_image_resize.py since I ran into a bunch of issues with ray.data.read_images

Related issue number

https://github.com/ray-project/kuberay/issues/2069

Checks

Logs from my test using a larger dataset:

$ kubectl logs -f image-resize-10-nodes-4wmng                                                                                                                                                                                                                                                             
2024-05-17 10:57:45,258 INFO cli.py:36 -- Job submission server address: http://image-resize-10-nodes-raycluster-t2sm7-head-svc.default.svc.cluster.local:8265                                                                                                                                                                                                             
2024-05-17 10:57:46,336 SUCC cli.py:60 -- --------------------------------------------------------                                                                                                                                                                                                                                                                         
2024-05-17 10:57:46,336 SUCC cli.py:61 -- Job 'image-resize-10-nodes-tmxgz' submitted successfully                                                                                                                                                                                                                                                                         
2024-05-17 10:57:46,336 SUCC cli.py:62 -- --------------------------------------------------------                                                                                                                                                                                                                                                                         
2024-05-17 10:57:46,336 INFO cli.py:285 -- Next steps                                                                                                                                                                                                                                                                                                                      
2024-05-17 10:57:46,336 INFO cli.py:286 -- Query the logs of the job:                                                                                                                                                                                                                                                                                                      
2024-05-17 10:57:46,336 INFO cli.py:288 -- ray job logs image-resize-10-nodes-tmxgz                                                                                                                                                                                                                                                                                        
2024-05-17 10:57:46,336 INFO cli.py:290 -- Query the status of the job:                                                                                                                                                                                                                                                                                                    
2024-05-17 10:57:46,336 INFO cli.py:292 -- ray job status image-resize-10-nodes-tmxgz                                                                                                                                                                                                                                                                                      
2024-05-17 10:57:46,336 INFO cli.py:294 -- Request the job to be stopped:                                                                                                                                                                                                                                                                                                  
2024-05-17 10:57:46,337 INFO cli.py:296 -- ray job stop image-resize-10-nodes-tmxgz                                                                                                                                                                                                                                                                                        
2024-05-17 10:57:46,349 INFO cli.py:303 -- Tailing logs until the job exits (disable with --no-wait):                                                                                                                                                                                                                                                                      
2024-05-17 10:57:52,968 INFO worker.py:1405 -- Using address 10.12.18.6:6379 set in the environment variable RAY_ADDRESS                                                                                                                                                                                                                                                   
2024-05-17 10:57:52,968 INFO worker.py:1540 -- Connecting to existing Ray cluster at address: 10.12.18.6:6379...                                                                                                                                                                                                                                                           
2024-05-17 10:57:52,983 INFO worker.py:1715 -- Connected to Ray cluster. View the dashboard at http://10.12.18.6:8265                                                                                                                                                                                                                                                      
2024-05-17 10:57:56,042 INFO streaming_executor.py:112 -- Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[Map(download_blob)->Map(transform_frame)]                                                                                                                                                                                                            
2024-05-17 10:57:56,043 INFO streaming_executor.py:113 -- Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), exclude_resources=ExecutionResources(cpu=0, gpu=0, object_store_memory=0), locality_with_output=False, preserve_order=False, actor_locality_enabled=True, verbose_progress=False)           
2024-05-17 10:57:56,043 INFO streaming_executor.py:115 -- Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True`                                                                                                                                                                                         

Running 0:   0%|          | 0/352 [00:00<?, ?it/s]                                                                                                                                                                                                                                                                                                                         
Running: 0.0/176.0 CPU, 0.0/0.0 GPU, 0.0 MiB/26.3 GiB object_store_memory:   0%|          | 0/352 [00:00<?, ?it/s]                                                                                                                                                                                                                                                         
Running: 50.0/176.0 CPU, 0.0/0.0 GPU, 0.06 MiB/26.3 GiB object_store_memory:   0%|          | 0/352 [00:00<?, ?it/s]                                                                                                                                                                                                                                                       
Running: 100.0/176.0 CPU, 0.0/0.0 GPU, 0.11 MiB/26.3 GiB object_store_memory:   0%|          | 0/352 [00:00<?, ?it/s]                                                                                                                                                                                                                                                      
Running: 150.0/176.0 CPU, 0.0/0.0 GPU, 0.17 MiB/26.3 GiB object_store_memory:   0%|          | 0/352 [00:00<?, ?it/s]                                                                                                                                                                                                                                                      
Running: 176.0/176.0 CPU, 0.0/0.0 GPU, 0.2 MiB/26.3 GiB object_store_memory:   0%|          | 0/352 [00:00<?, ?it/s]                                                                                                                                                                                                                                                       
Running: 175.0/176.0 CPU, 0.0/0.0 GPU, 2.95 MiB/26.3 GiB object_store_memory:   0%|          | 0/352 [00:38<?, ?it/s]                                                                                                                                                                                                                                                      
Running: 175.0/176.0 CPU, 0.0/0.0 GPU, 2.95 MiB/26.3 GiB object_store_memory:   0%|          | 1/352 [00:38<3:46:22, 38.70s/it]                                                                                                                                                                                                                                            
Running: 176.0/176.0 CPU, 0.0/0.0 GPU, 0.2 MiB/26.3 GiB object_store_memory:   0%|          | 1/352 [00:38<3:46:22, 38.70s/it]                                                                                                                                                                                                                                             
Running: 175.0/176.0 CPU, 0.0/0.0 GPU, 2.95 MiB/26.3 GiB object_store_memory:   0%|          | 1/352 [00:40<3:46:22, 38.70s/it]                                                                                                                                                                                                                                            
Running: 175.0/176.0 CPU, 0.0/0.0 GPU, 2.95 MiB/26.3 GiB object_store_memory:   1%|          | 2/352 [00:40<1:40:05, 17.16s/it]                                                                                                                                                                                                                                            
Running: 176.0/176.0 CPU, 0.0/0.0 GPU, 0.2 MiB/26.3 GiB object_store_memory:   1%|          | 2/352 [00:40<1:40:05, 17.16s/it]                                                                                                                                                                                                                                             
Running: 175.0/176.0 CPU, 0.0/0.0 GPU, 2.95 MiB/26.3 GiB object_store_memory:   1%|          | 2/352 [00:41<1:40:05, 17.16s/it]                                                                                                                                                                                                                                            
Running: 175.0/176.0 CPU, 0.0/0.0 GPU, 2.95 MiB/26.3 GiB object_store_memory:   1%|          | 3/352 [00:41<55:21,  9.52s/it]                                                                                                                                                                                                                                              
Running: 176.0/176.0 CPU, 0.0/0.0 GPU, 0.2 MiB/26.3 GiB object_store_memory:   1%|          | 3/352 [00:41<55:21,  9.52s/it]                                                                                                                                                                                                                                               
Running: 175.0/176.0 CPU, 0.0/0.0 GPU, 2.95 MiB/26.3 GiB object_store_memory:   1%|          | 3/352 [00:41<55:21,  9.52s/it]                                                                                                                                                                                                                                              
Running: 175.0/176.0 CPU, 0.0/0.0 GPU, 2.95 MiB/26.3 GiB object_store_memory:   1%|          | 4/352 [00:41<34:22,  5.93s/it]                                                                                                                                                                                                                                              
Running: 176.0/176.0 CPU, 0.0/0.0 GPU, 0.2 MiB/26.3 GiB object_store_memory:   1%|          | 4/352 [00:41<34:22,  5.93s/it]                                                                                                                                                                                                                                               
Running: 175.0/176.0 CPU, 0.0/0.0 GPU, 2.95 MiB/26.3 GiB object_store_memory:   1%|          | 4/352 [00:42<34:22,  5.93s/it]                                                                                                                                                                                                                                              
Running: 175.0/176.0 CPU, 0.0/0.0 GPU, 2.95 MiB/26.3 GiB object_store_memory:   1%|▏         | 5/352 [00:42<22:47,  3.94s/it]                                                                                                                                                                                                                                              
Running: 176.0/176.0 CPU, 0.0/0.0 GPU, 0.2 MiB/26.3 GiB object_store_memory:   1%|▏         | 5/352 [00:42<22:47,  3.94s/it]                                                                                                                                                                                                                                               
Running: 176.0/176.0 CPU, 0.0/0.0 GPU, 0.2 MiB/26.3 GiB object_store_memory:   2%|▏         | 6/352 [00:42<15:36,  2.71s/it]                                                                                                                                                                                                                                               
Running: 175.0/176.0 CPU, 0.0/0.0 GPU, 0.2 MiB/26.3 GiB object_store_memory:   2%|▏         | 6/352 [00:42<15:36,  2.71s/it]