Closed zshareef closed 9 months ago
@matthewdeng Do you know what is the most updated version of this example? Maybe we could point the user to that.
Hi @matthewdeng
Thank you for the reply and the reference.
Unfortunately, this example is also out-date.
For example, in this code you have
from ray.train import ScalingConfig
Although now ScalingConfig
is not the part of ray.train
, rather it is part of ray.air.config
.
After adjusting these things, I am getting the error:
AttributeError: module 'ray.train' has no attribute 'get_context'
Ah yes this is for Ray 2.7, you can try out these new APIs with https://pypi.org/project/ray/2.7.0rc0/
hey @zshareef - following up on the comment from @matthewdeng above we recently reworked our Train API layer as part of the Ray 2.7 release (which is now out!)
Can you please follow the aforementioned example from Matt and see if you're able to get it to run successfully?
No response - closing. Please re-open if necessary.
What happened + What you expected to happen
I am trying to replicate the example given on the following page: https://docs.ray.io/en/latest/ray-air/examples/torch_image_example.html
I get an error when I start the training using
TorchTrainer
. The Error is given below:(TorchTrainer pid=12446) The
preprocessor
arg to Trainer is deprecated. Apply preprocessor transformations ahead of time by callingpreprocessor.transform(ds)
. Support for the preprocessor arg will be dropped in a future release. (TorchTrainer pid=12446) Starting distributed worker processes: ['12451 (127.0.0.1)'] (RayTrainWorker pid=12451) Setting up process group for: env:// [rank=0, world_size=1] (RayTrainWorker pid=12451) Moving model to device: cpu (RayTrainWorker pid=12451) Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[MapBatches(TorchVisionPreprocessor._transform_numpy)] -> AllToAllOperator[RandomizeBlockOrder] (RayTrainWorker pid=12451) Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), locality_with_output=False, preserve_order=False, actor_locality_enabled=True, verbose_progress=False) (RayTrainWorker pid=12451) Tip: For detailed progress reporting, runray.data.DataContext.get_current().execution_options.verbose_progress = True
(MapBatches(TorchVisionPreprocessor._transform_numpy) pid=12457) Task failed with retryable exception: TaskID(3db2c514f2277a03ffffffffffffffffffffffff01000000).(MapBatches(TorchVisionPreprocessor._transform_numpy) pid=12457) Traceback (most recent call last):
(MapBatches(TorchVisionPreprocessor._transform_numpy) pid=12457) File "python/ray/_raylet.pyx", line 1191, in ray._raylet.execute_dynamic_generator_and_store_task_outputs
(MapBatches(TorchVisionPreprocessor._transform_numpy) pid=12457) File "python/ray/_raylet.pyx", line 3684, in ray._raylet.CoreWorker.store_task_outputs
(MapBatches(TorchVisionPreprocessor._transform_numpy) pid=12457) File "/opt/homebrew/lib/python3.11/site-packages/ray/data/_internal/execution/operators/map_operator.py", line 415, in _map_task
(MapBatches(TorchVisionPreprocessor._transform_numpy) pid=12457) for b_out in fn(iter(blocks), ctx):
(MapBatches(TorchVisionPreprocessor._transform_numpy) pid=12457) File "/opt/homebrew/lib/python3.11/site-packages/ray/data/_internal/planner/plan_udf_map_op.py", line 76, in do_map
(MapBatches(TorchVisionPreprocessor._transform_numpy) pid=12457) yield from transform_fn(blocks, ctx, *fn_args, fn_kwargs)
(MapBatches(TorchVisionPreprocessor._transform_numpy) pid=12457) File "/opt/homebrew/lib/python3.11/site-packages/ray/data/_internal/planner/map_batches.py", line 118, in fn
(MapBatches(TorchVisionPreprocessor._transform_numpy) pid=12457) yield from process_next_batch(batch)
(MapBatches(TorchVisionPreprocessor._transform_numpy) pid=12457) File "/opt/homebrew/lib/python3.11/site-packages/ray/data/_internal/planner/map_batches.py", line 79, in process_next_batch
(MapBatches(TorchVisionPreprocessor._transform_numpy) pid=12457) batch = batch_fn(batch, *fn_args, *fn_kwargs)
(MapBatches(TorchVisionPreprocessor._transform_numpy) pid=12457) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(MapBatches(TorchVisionPreprocessor._transform_numpy) pid=12457) File "/opt/homebrew/lib/python3.11/site-packages/ray/data/preprocessors/torch.py", line 140, in _transform_numpy
(MapBatches(TorchVisionPreprocessor._transform_numpy) pid=12457) data_batch[output_col] = transform_batch(data_batch[input_col])
(MapBatches(TorchVisionPreprocessor._transform_numpy) pid=12457)
~~^^^^^^^^^^^(MapBatches(TorchVisionPreprocessor._transform_numpy) pid=12457) KeyError: 'image'
2023-09-07 13:17:33,527 ERROR tune_controller.py:911 -- Trial task failed for trial TorchTrainer_1efaa_00000
Traceback (most recent call last): File "/opt/homebrew/lib/python3.11/site-packages/ray/air/execution/_internal/event_manager.py", line 110, in resolve_future result = ray.get(future) ^^^^^^^^^^^^^^^ File "/opt/homebrew/lib/python3.11/site-packages/ray/_private/auto_init_hook.py", line 24, in auto_init_wrapper return fn(args, kwargs) ^^^^^^^^^^^^^^^^^^^ File "/opt/homebrew/lib/python3.11/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper return func(*args, kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/opt/homebrew/lib/python3.11/site-packages/ray/_private/worker.py", line 2524, in get raise value.as_instanceof_cause() ray.exceptions.RayTaskError(KeyError): ray::_Inner.train() (pid=12446, ip=127.0.0.1, actor_id=139b82138f91e1569ff37d8c01000000, repr=TorchTrainer) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/homebrew/lib/python3.11/site-packages/ray/tune/trainable/trainable.py", line 375, in train raise skipped from exception_cause(skipped) File "/opt/homebrew/lib/python3.11/site-packages/ray/train/_internal/utils.py", line 54, in check_for_failure ray.get(object_ref) ^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^ ray.exceptions.RayTaskError(KeyError): ray::_RayTrainWorkerexecute.get_next() (pid=12451, ip=127.0.0.1, actor_id=406365c73073c30c6f91a95d01000000, repr=<ray.train._internal.worker_group.RayTrainWorker object at 0x117751750>) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/homebrew/lib/python3.11/site-packages/ray/train/_internal/worker_group.py", line 32, in execute raise skipped from exception_cause(skipped) File "/opt/homebrew/lib/python3.11/site-packages/ray/train/_internal/utils.py", line 129, in discard_return_wrapper train_func(*args, *kwargs) File "/Users/zsharee/RD/Ray/Torch_Image_Classifier.py", line 62, in train_loop_per_worker for i, batch in enumerate(train_dataset_batches): File "/opt/homebrew/lib/python3.11/site-packages/ray/data/iterator.py", line 366, in iter_torch_batches yield from self.iter_batches( File "/opt/homebrew/lib/python3.11/site-packages/ray/data/iterator.py", line 159, in iter_batches block_iterator, stats, blocks_owned_by_consumer = self._to_block_iterator() ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/homebrew/lib/python3.11/site-packages/ray/data/_internal/iterator/iterator_impl.py", line 32, in _to_block_iterator block_iterator, stats, executor = ds._plan.execute_to_iterator() ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/homebrew/lib/python3.11/site-packages/ray/data/_internal/plan.py", line 538, in execute_to_iterator block_iter = itertools.chain([next(gen)], gen) ^^^^^^^^^ File "/opt/homebrew/lib/python3.11/site-packages/ray/data/_internal/execution/legacy_compat.py", line 51, in execute_to_legacy_block_iterator for bundle in bundle_iter: File "/opt/homebrew/lib/python3.11/site-packages/ray/data/_internal/execution/interfaces.py", line 548, in next return self.get_next() ^^^^^^^^^^^^^^^ File "/opt/homebrew/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor.py", line 129, in get_next raise item File "/opt/homebrew/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor.py", line 187, in run while self._scheduling_loop_step(self._topology) and not self._shutdown: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/homebrew/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor.py", line 235, in _scheduling_loop_step process_completed_tasks(topology) File "/opt/homebrew/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor_state.py", line 333, in process_completed_tasks op.notify_work_completed(ref) File "/opt/homebrew/lib/python3.11/site-packages/ray/data/_internal/execution/operators/task_pool_map_operator.py", line 65, in notify_work_completed task.output = self._map_ref_to_ref_bundle(ref) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/homebrew/lib/python3.11/site-packages/ray/data/_internal/execution/operators/map_operator.py", line 357, in _map_ref_to_ref_bundle all_refs = list(ray.get(ref)) ^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^ ray.exceptions.RayTaskError(KeyError): ray::MapBatches(TorchVisionPreprocessor._transform_numpy)() (pid=12459, ip=127.0.0.1) File "/opt/homebrew/lib/python3.11/site-packages/ray/data/_internal/execution/operators/map_operator.py", line 415, in _map_task for b_out in fn(iter(blocks), ctx): File "/opt/homebrew/lib/python3.11/site-packages/ray/data/_internal/planner/plan_udf_map_op.py", line 76, in do_map yield from transform_fn(blocks, ctx, fn_args, fn_kwargs) File "/opt/homebrew/lib/python3.11/site-packages/ray/data/_internal/planner/map_batches.py", line 118, in fn yield from process_next_batch(batch) File "/opt/homebrew/lib/python3.11/site-packages/ray/data/_internal/planner/map_batches.py", line 79, in process_next_batch batch = batch_fn(batch, *fn_args, **fn_kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/homebrew/lib/python3.11/site-packages/ray/data/preprocessors/torch.py", line 140, in _transform_numpy data_batch[output_col] = transform_batch(data_batch[input_col])