Closed andrewsykim closed 1 month ago
This PR makes the bucket and bucket prefix configurable in the rat-data-image-resize sample job.
I also refactored ray_data_image_resize.py since I ran into a bunch of issues with ray.data.read_images
ray.data.read_images
https://github.com/ray-project/kuberay/issues/2069
Logs from my test using a larger dataset:
$ kubectl logs -f image-resize-10-nodes-4wmng 2024-05-17 10:57:45,258 INFO cli.py:36 -- Job submission server address: http://image-resize-10-nodes-raycluster-t2sm7-head-svc.default.svc.cluster.local:8265 2024-05-17 10:57:46,336 SUCC cli.py:60 -- -------------------------------------------------------- 2024-05-17 10:57:46,336 SUCC cli.py:61 -- Job 'image-resize-10-nodes-tmxgz' submitted successfully 2024-05-17 10:57:46,336 SUCC cli.py:62 -- -------------------------------------------------------- 2024-05-17 10:57:46,336 INFO cli.py:285 -- Next steps 2024-05-17 10:57:46,336 INFO cli.py:286 -- Query the logs of the job: 2024-05-17 10:57:46,336 INFO cli.py:288 -- ray job logs image-resize-10-nodes-tmxgz 2024-05-17 10:57:46,336 INFO cli.py:290 -- Query the status of the job: 2024-05-17 10:57:46,336 INFO cli.py:292 -- ray job status image-resize-10-nodes-tmxgz 2024-05-17 10:57:46,336 INFO cli.py:294 -- Request the job to be stopped: 2024-05-17 10:57:46,337 INFO cli.py:296 -- ray job stop image-resize-10-nodes-tmxgz 2024-05-17 10:57:46,349 INFO cli.py:303 -- Tailing logs until the job exits (disable with --no-wait): 2024-05-17 10:57:52,968 INFO worker.py:1405 -- Using address 10.12.18.6:6379 set in the environment variable RAY_ADDRESS 2024-05-17 10:57:52,968 INFO worker.py:1540 -- Connecting to existing Ray cluster at address: 10.12.18.6:6379... 2024-05-17 10:57:52,983 INFO worker.py:1715 -- Connected to Ray cluster. View the dashboard at http://10.12.18.6:8265 2024-05-17 10:57:56,042 INFO streaming_executor.py:112 -- Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[Map(download_blob)->Map(transform_frame)] 2024-05-17 10:57:56,043 INFO streaming_executor.py:113 -- Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), exclude_resources=ExecutionResources(cpu=0, gpu=0, object_store_memory=0), locality_with_output=False, preserve_order=False, actor_locality_enabled=True, verbose_progress=False) 2024-05-17 10:57:56,043 INFO streaming_executor.py:115 -- Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True` Running 0: 0%| | 0/352 [00:00<?, ?it/s] Running: 0.0/176.0 CPU, 0.0/0.0 GPU, 0.0 MiB/26.3 GiB object_store_memory: 0%| | 0/352 [00:00<?, ?it/s] Running: 50.0/176.0 CPU, 0.0/0.0 GPU, 0.06 MiB/26.3 GiB object_store_memory: 0%| | 0/352 [00:00<?, ?it/s] Running: 100.0/176.0 CPU, 0.0/0.0 GPU, 0.11 MiB/26.3 GiB object_store_memory: 0%| | 0/352 [00:00<?, ?it/s] Running: 150.0/176.0 CPU, 0.0/0.0 GPU, 0.17 MiB/26.3 GiB object_store_memory: 0%| | 0/352 [00:00<?, ?it/s] Running: 176.0/176.0 CPU, 0.0/0.0 GPU, 0.2 MiB/26.3 GiB object_store_memory: 0%| | 0/352 [00:00<?, ?it/s] Running: 175.0/176.0 CPU, 0.0/0.0 GPU, 2.95 MiB/26.3 GiB object_store_memory: 0%| | 0/352 [00:38<?, ?it/s] Running: 175.0/176.0 CPU, 0.0/0.0 GPU, 2.95 MiB/26.3 GiB object_store_memory: 0%| | 1/352 [00:38<3:46:22, 38.70s/it] Running: 176.0/176.0 CPU, 0.0/0.0 GPU, 0.2 MiB/26.3 GiB object_store_memory: 0%| | 1/352 [00:38<3:46:22, 38.70s/it] Running: 175.0/176.0 CPU, 0.0/0.0 GPU, 2.95 MiB/26.3 GiB object_store_memory: 0%| | 1/352 [00:40<3:46:22, 38.70s/it] Running: 175.0/176.0 CPU, 0.0/0.0 GPU, 2.95 MiB/26.3 GiB object_store_memory: 1%| | 2/352 [00:40<1:40:05, 17.16s/it] Running: 176.0/176.0 CPU, 0.0/0.0 GPU, 0.2 MiB/26.3 GiB object_store_memory: 1%| | 2/352 [00:40<1:40:05, 17.16s/it] Running: 175.0/176.0 CPU, 0.0/0.0 GPU, 2.95 MiB/26.3 GiB object_store_memory: 1%| | 2/352 [00:41<1:40:05, 17.16s/it] Running: 175.0/176.0 CPU, 0.0/0.0 GPU, 2.95 MiB/26.3 GiB object_store_memory: 1%| | 3/352 [00:41<55:21, 9.52s/it] Running: 176.0/176.0 CPU, 0.0/0.0 GPU, 0.2 MiB/26.3 GiB object_store_memory: 1%| | 3/352 [00:41<55:21, 9.52s/it] Running: 175.0/176.0 CPU, 0.0/0.0 GPU, 2.95 MiB/26.3 GiB object_store_memory: 1%| | 3/352 [00:41<55:21, 9.52s/it] Running: 175.0/176.0 CPU, 0.0/0.0 GPU, 2.95 MiB/26.3 GiB object_store_memory: 1%| | 4/352 [00:41<34:22, 5.93s/it] Running: 176.0/176.0 CPU, 0.0/0.0 GPU, 0.2 MiB/26.3 GiB object_store_memory: 1%| | 4/352 [00:41<34:22, 5.93s/it] Running: 175.0/176.0 CPU, 0.0/0.0 GPU, 2.95 MiB/26.3 GiB object_store_memory: 1%| | 4/352 [00:42<34:22, 5.93s/it] Running: 175.0/176.0 CPU, 0.0/0.0 GPU, 2.95 MiB/26.3 GiB object_store_memory: 1%|▏ | 5/352 [00:42<22:47, 3.94s/it] Running: 176.0/176.0 CPU, 0.0/0.0 GPU, 0.2 MiB/26.3 GiB object_store_memory: 1%|▏ | 5/352 [00:42<22:47, 3.94s/it] Running: 176.0/176.0 CPU, 0.0/0.0 GPU, 0.2 MiB/26.3 GiB object_store_memory: 2%|▏ | 6/352 [00:42<15:36, 2.71s/it] Running: 175.0/176.0 CPU, 0.0/0.0 GPU, 0.2 MiB/26.3 GiB object_store_memory: 2%|▏ | 6/352 [00:42<15:36, 2.71s/it]
Why are these changes needed?
This PR makes the bucket and bucket prefix configurable in the rat-data-image-resize sample job.
I also refactored ray_data_image_resize.py since I ran into a bunch of issues with
ray.data.read_images
Related issue number
https://github.com/ray-project/kuberay/issues/2069
Checks
Logs from my test using a larger dataset: