ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.93k stars 5.77k forks source link

[train/data] Problem by running TorchTrainer from ray.train.torch #39380

Closed zshareef closed 9 months ago

zshareef commented 1 year ago

What happened + What you expected to happen

I am trying to replicate the example given on the following page: https://docs.ray.io/en/latest/ray-air/examples/torch_image_example.html

I get an error when I start the training using TorchTrainer. The Error is given below:

(TorchTrainer pid=12446) The preprocessor arg to Trainer is deprecated. Apply preprocessor transformations ahead of time by calling preprocessor.transform(ds). Support for the preprocessor arg will be dropped in a future release. (TorchTrainer pid=12446) Starting distributed worker processes: ['12451 (127.0.0.1)'] (RayTrainWorker pid=12451) Setting up process group for: env:// [rank=0, world_size=1] (RayTrainWorker pid=12451) Moving model to device: cpu (RayTrainWorker pid=12451) Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[MapBatches(TorchVisionPreprocessor._transform_numpy)] -> AllToAllOperator[RandomizeBlockOrder] (RayTrainWorker pid=12451) Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), locality_with_output=False, preserve_order=False, actor_locality_enabled=True, verbose_progress=False) (RayTrainWorker pid=12451) Tip: For detailed progress reporting, run ray.data.DataContext.get_current().execution_options.verbose_progress = True (MapBatches(TorchVisionPreprocessor._transform_numpy) pid=12457) Task failed with retryable exception: TaskID(3db2c514f2277a03ffffffffffffffffffffffff01000000).
(MapBatches(TorchVisionPreprocessor._transform_numpy) pid=12457) Traceback (most recent call last):
(MapBatches(TorchVisionPreprocessor._transform_numpy) pid=12457) File "python/ray/_raylet.pyx", line 1191, in ray._raylet.execute_dynamic_generator_and_store_task_outputs
(MapBatches(TorchVisionPreprocessor._transform_numpy) pid=12457) File "python/ray/_raylet.pyx", line 3684, in ray._raylet.CoreWorker.store_task_outputs
(MapBatches(TorchVisionPreprocessor._transform_numpy) pid=12457) File "/opt/homebrew/lib/python3.11/site-packages/ray/data/_internal/execution/operators/map_operator.py", line 415, in _map_task
(MapBatches(TorchVisionPreprocessor._transform_numpy) pid=12457) for b_out in fn(iter(blocks), ctx):
(MapBatches(TorchVisionPreprocessor._transform_numpy) pid=12457) File "/opt/homebrew/lib/python3.11/site-packages/ray/data/_internal/planner/plan_udf_map_op.py", line 76, in do_map
(MapBatches(TorchVisionPreprocessor._transform_numpy) pid=12457) yield from transform_fn(blocks, ctx, *fn_args, fn_kwargs)
(MapBatches(TorchVisionPreprocessor._transform_numpy) pid=12457) File "/opt/homebrew/lib/python3.11/site-packages/ray/data/_internal/planner/map_batches.py", line 118, in fn
(MapBatches(TorchVisionPreprocessor._transform_numpy) pid=12457) yield from process_next_batch(batch)
(MapBatches(TorchVisionPreprocessor._transform_numpy) pid=12457) File "/opt/homebrew/lib/python3.11/site-packages/ray/data/_internal/planner/map_batches.py", line 79, in process_next_batch
(MapBatches(TorchVisionPreprocessor._transform_numpy) pid=12457) batch = batch_fn(batch, *fn_args, *fn_kwargs)
(MapBatches(TorchVisionPreprocessor._transform_numpy) pid=12457) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(MapBatches(TorchVisionPreprocessor._transform_numpy) pid=12457) File "/opt/homebrew/lib/python3.11/site-packages/ray/data/preprocessors/torch.py", line 140, in _transform_numpy
(MapBatches(TorchVisionPreprocessor._transform_numpy) pid=12457) data_batch[output_col] = transform_batch(data_batch[input_col])
(MapBatches(TorchVisionPreprocessor._transform_numpy) pid=12457) ~~^^^^^^^^^^^
(MapBatches(TorchVisionPreprocessor._transform_numpy) pid=12457) KeyError: 'image'
2023-09-07 13:17:33,527 ERROR tune_controller.py:911 -- Trial task failed for trial TorchTrainer_1efaa_00000
Traceback (most recent call last): File "/opt/homebrew/lib/python3.11/site-packages/ray/air/execution/_internal/event_manager.py", line 110, in resolve_future result = ray.get(future) ^^^^^^^^^^^^^^^ File "/opt/homebrew/lib/python3.11/site-packages/ray/_private/auto_init_hook.py", line 24, in auto_init_wrapper return fn(
args,
kwargs) ^^^^^^^^^^^^^^^^^^^ File "/opt/homebrew/lib/python3.11/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper return func(*args, kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/opt/homebrew/lib/python3.11/site-packages/ray/_private/worker.py", line 2524, in get raise value.as_instanceof_cause() ray.exceptions.RayTaskError(KeyError): ray::_Inner.train() (pid=12446, ip=127.0.0.1, actor_id=139b82138f91e1569ff37d8c01000000, repr=TorchTrainer) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/homebrew/lib/python3.11/site-packages/ray/tune/trainable/trainable.py", line 375, in train raise skipped from exception_cause(skipped) File "/opt/homebrew/lib/python3.11/site-packages/ray/train/_internal/utils.py", line 54, in check_for_failure ray.get(object_ref) ^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^ ray.exceptions.RayTaskError(KeyError): ray::_RayTrainWorkerexecute.get_next() (pid=12451, ip=127.0.0.1, actor_id=406365c73073c30c6f91a95d01000000, repr=<ray.train._internal.worker_group.RayTrainWorker object at 0x117751750>) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/homebrew/lib/python3.11/site-packages/ray/train/_internal/worker_group.py", line 32, in execute raise skipped from exception_cause(skipped) File "/opt/homebrew/lib/python3.11/site-packages/ray/train/_internal/utils.py", line 129, in discard_return_wrapper train_func(*args, *kwargs) File "/Users/zsharee/RD/Ray/Torch_Image_Classifier.py", line 62, in train_loop_per_worker for i, batch in enumerate(train_dataset_batches): File "/opt/homebrew/lib/python3.11/site-packages/ray/data/iterator.py", line 366, in iter_torch_batches yield from self.iter_batches( File "/opt/homebrew/lib/python3.11/site-packages/ray/data/iterator.py", line 159, in iter_batches block_iterator, stats, blocks_owned_by_consumer = self._to_block_iterator() ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/homebrew/lib/python3.11/site-packages/ray/data/_internal/iterator/iterator_impl.py", line 32, in _to_block_iterator block_iterator, stats, executor = ds._plan.execute_to_iterator() ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/homebrew/lib/python3.11/site-packages/ray/data/_internal/plan.py", line 538, in execute_to_iterator block_iter = itertools.chain([next(gen)], gen) ^^^^^^^^^ File "/opt/homebrew/lib/python3.11/site-packages/ray/data/_internal/execution/legacy_compat.py", line 51, in execute_to_legacy_block_iterator for bundle in bundle_iter: File "/opt/homebrew/lib/python3.11/site-packages/ray/data/_internal/execution/interfaces.py", line 548, in next return self.get_next() ^^^^^^^^^^^^^^^ File "/opt/homebrew/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor.py", line 129, in get_next raise item File "/opt/homebrew/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor.py", line 187, in run while self._scheduling_loop_step(self._topology) and not self._shutdown: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/homebrew/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor.py", line 235, in _scheduling_loop_step process_completed_tasks(topology) File "/opt/homebrew/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor_state.py", line 333, in process_completed_tasks op.notify_work_completed(ref) File "/opt/homebrew/lib/python3.11/site-packages/ray/data/_internal/execution/operators/task_pool_map_operator.py", line 65, in notify_work_completed task.output = self._map_ref_to_ref_bundle(ref) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/homebrew/lib/python3.11/site-packages/ray/data/_internal/execution/operators/map_operator.py", line 357, in _map_ref_to_ref_bundle all_refs = list(ray.get(ref)) ^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^ ray.exceptions.RayTaskError(KeyError): ray::MapBatches(TorchVisionPreprocessor._transform_numpy)() (pid=12459, ip=127.0.0.1) File "/opt/homebrew/lib/python3.11/site-packages/ray/data/_internal/execution/operators/map_operator.py", line 415, in _map_task for b_out in fn(iter(blocks), ctx): File "/opt/homebrew/lib/python3.11/site-packages/ray/data/_internal/planner/plan_udf_map_op.py", line 76, in do_map yield from transform_fn(blocks, ctx, fn_args, fn_kwargs) File "/opt/homebrew/lib/python3.11/site-packages/ray/data/_internal/planner/map_batches.py", line 118, in fn yield from process_next_batch(batch) File "/opt/homebrew/lib/python3.11/site-packages/ray/data/_internal/planner/map_batches.py", line 79, in process_next_batch batch = batch_fn(batch, *fn_args, **fn_kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/homebrew/lib/python3.11/site-packages/ray/data/preprocessors/torch.py", line 140, in _transform_numpy data_batch[output_col] = transform_batch(data_batch[input_col])


KeyError: 'image'
2023-09-07 13:17:33,531 ERROR tune.py:1144 -- Trials did not complete: [TorchTrainer_1efaa_00000]
2023-09-07 13:17:33,534 WARNING experiment_analysis.py:916 -- Failed to read the results for 1 trials:
- /Users/zsharee/ray_results/TorchTrainer_2023-09-07_13-17-24/TorchTrainer_1efaa_00000_0_2023-09-07_13-17-24
ray.exceptions.RayTaskError(KeyError): ray::_Inner.train() (pid=12446, ip=127.0.0.1, actor_id=139b82138f91e1569ff37d8c01000000, repr=TorchTrainer)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/ray/tune/trainable/trainable.py", line 375, in train
    raise skipped from exception_cause(skipped)
  File "/opt/homebrew/lib/python3.11/site-packages/ray/train/_internal/utils.py", line 54, in check_for_failure
    ray.get(object_ref)
           ^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^
ray.exceptions.RayTaskError(KeyError): ray::_RayTrainWorker__execute.get_next() (pid=12451, ip=127.0.0.1, actor_id=406365c73073c30c6f91a95d01000000, repr=<ray.train._internal.worker_group.RayTrainWorker object at 0x117751750>)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/ray/train/_internal/worker_group.py", line 32, in __execute
    raise skipped from exception_cause(skipped)
  File "/opt/homebrew/lib/python3.11/site-packages/ray/train/_internal/utils.py", line 129, in discard_return_wrapper
    train_func(*args, **kwargs)
  File "/Users/zsharee/RD/Ray/Torch_Image_Classifier.py", line 62, in train_loop_per_worker
    for i, batch in enumerate(train_dataset_batches):
  File "/opt/homebrew/lib/python3.11/site-packages/ray/data/iterator.py", line 366, in iter_torch_batches
    yield from self.iter_batches(
  File "/opt/homebrew/lib/python3.11/site-packages/ray/data/iterator.py", line 159, in iter_batches
    block_iterator, stats, blocks_owned_by_consumer = self._to_block_iterator()
                                                      ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/ray/data/_internal/iterator/iterator_impl.py", line 32, in _to_block_iterator
    block_iterator, stats, executor = ds._plan.execute_to_iterator()
                                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/ray/data/_internal/plan.py", line 538, in execute_to_iterator
    block_iter = itertools.chain([next(gen)], gen)
                                  ^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/ray/data/_internal/execution/legacy_compat.py", line 51, in execute_to_legacy_block_iterator
    for bundle in bundle_iter:
  File "/opt/homebrew/lib/python3.11/site-packages/ray/data/_internal/execution/interfaces.py", line 548, in __next__
    return self.get_next()
           ^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor.py", line 129, in get_next
    raise item
  File "/opt/homebrew/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor.py", line 187, in run
    while self._scheduling_loop_step(self._topology) and not self._shutdown:
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor.py", line 235, in _scheduling_loop_step
    process_completed_tasks(topology)
  File "/opt/homebrew/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor_state.py", line 333, in process_completed_tasks
    op.notify_work_completed(ref)
  File "/opt/homebrew/lib/python3.11/site-packages/ray/data/_internal/execution/operators/task_pool_map_operator.py", line 65, in notify_work_completed
    task.output = self._map_ref_to_ref_bundle(ref)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/ray/data/_internal/execution/operators/map_operator.py", line 357, in _map_ref_to_ref_bundle
    all_refs = list(ray.get(ref))
                    ^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^
ray.exceptions.RayTaskError(KeyError): ray::MapBatches(TorchVisionPreprocessor._transform_numpy)() (pid=12459, ip=127.0.0.1)
  File "/opt/homebrew/lib/python3.11/site-packages/ray/data/_internal/execution/operators/map_operator.py", line 415, in _map_task
    for b_out in fn(iter(blocks), ctx):
  File "/opt/homebrew/lib/python3.11/site-packages/ray/data/_internal/planner/plan_udf_map_op.py", line 76, in do_map
    yield from transform_fn(blocks, ctx, *fn_args, **fn_kwargs)
  File "/opt/homebrew/lib/python3.11/site-packages/ray/data/_internal/planner/map_batches.py", line 118, in fn
    yield from process_next_batch(batch)
  File "/opt/homebrew/lib/python3.11/site-packages/ray/data/_internal/planner/map_batches.py", line 79, in process_next_batch
    batch = batch_fn(batch, *fn_args, **fn_kwargs)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/ray/data/preprocessors/torch.py", line 140, in _transform_numpy
    data_batch[output_col] = transform_batch(data_batch[input_col])
                                             ~~~~~~~~~~^^^^^^^^^^^
KeyError: 'image'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/zsharee/RD/Ray/Torch_Image_Classifier.py", line 99, in <module>
    result = trainer.fit()
             ^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/ray/train/base_trainer.py", line 630, in fit
    raise TrainingFailedError(
ray.train.base_trainer.TrainingFailedError: The Ray Train run failed. Please inspect the previous error messages for a cause. After fixing the issue (assuming that the error is not caused by your own application logic, but rather an error such as OOM), you can restart the run from scratch or continue this run.
To continue this run, you can use: `trainer = TorchTrainer.restore("/Users/zsharee/ray_results/TorchTrainer_2023-09-07_13-17-24")`.
To start a new run that will retry on training failures, set `air.RunConfig(failure_config=air.FailureConfig(max_failures))` in the Trainer's `run_config` with `max_failures > 0`, or `max_failures = -1` for unlimited retries.
(MapBatches(TorchVisionPreprocessor._transform_numpy) pid=12461) Task failed with retryable exception: TaskID(a1531e4d5691c0d7ffffffffffffffffffffffff01000000). [repeated 39x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/ray-logging.html#log-deduplication for more options.)
(MapBatches(TorchVisionPreprocessor._transform_numpy) pid=12461) Traceback (most recent call last): [repeated 39x across cluster]
(MapBatches(TorchVisionPreprocessor._transform_numpy) pid=12461)   File "python/ray/_raylet.pyx", line 1191, in ray._raylet.execute_dynamic_generator_and_store_task_outputs [repeated 39x across cluster]
(MapBatches(TorchVisionPreprocessor._transform_numpy) pid=12461)   File "python/ray/_raylet.pyx", line 3684, in ray._raylet.CoreWorker.store_task_outputs [repeated 39x across cluster]
(MapBatches(TorchVisionPreprocessor._transform_numpy) pid=12461)   File "/opt/homebrew/lib/python3.11/site-packages/ray/data/_internal/execution/operators/map_operator.py", line 415, in _map_task [repeated 39x across cluster]
(MapBatches(TorchVisionPreprocessor._transform_numpy) pid=12461)     for b_out in fn(iter(blocks), ctx): [repeated 39x across cluster]
(MapBatches(TorchVisionPreprocessor._transform_numpy) pid=12461)   File "/opt/homebrew/lib/python3.11/site-packages/ray/data/_internal/planner/plan_udf_map_op.py", line 76, in do_map [repeated 39x across cluster]
(MapBatches(TorchVisionPreprocessor._transform_numpy) pid=12461)     yield from transform_fn(blocks, ctx, *fn_args, **fn_kwargs) [repeated 39x across cluster]
(MapBatches(TorchVisionPreprocessor._transform_numpy) pid=12461)   File "/opt/homebrew/lib/python3.11/site-packages/ray/data/_internal/planner/map_batches.py", line 118, in fn [repeated 39x across cluster]
(MapBatches(TorchVisionPreprocessor._transform_numpy) pid=12461)     yield from process_next_batch(batch) [repeated 39x across cluster]
(MapBatches(TorchVisionPreprocessor._transform_numpy) pid=12461)   File "/opt/homebrew/lib/python3.11/site-packages/ray/data/_internal/planner/map_batches.py", line 79, in process_next_batch [repeated 39x across cluster]
(MapBatches(TorchVisionPreprocessor._transform_numpy) pid=12461)     batch = batch_fn(batch, *fn_args, **fn_kwargs) [repeated 39x across cluster]
(MapBatches(TorchVisionPreprocessor._transform_numpy) pid=12461)             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [repeated 39x across cluster]
(MapBatches(TorchVisionPreprocessor._transform_numpy) pid=12461)   File "/opt/homebrew/lib/python3.11/site-packages/ray/data/preprocessors/torch.py", line 140, in _transform_numpy [repeated 39x across cluster]
(MapBatches(TorchVisionPreprocessor._transform_numpy) pid=12461)     data_batch[output_col] = transform_batch(data_batch[input_col]) [repeated 39x across cluster]
(MapBatches(TorchVisionPreprocessor._transform_numpy) pid=12461)                                              ~~~~~~~~~~^^^^^^^^^^^ [repeated 39x across cluster]
(MapBatches(TorchVisionPreprocessor._transform_numpy) pid=12461) KeyError: 'image' [repeated 39x across cluster]

### Versions / Dependencies

I am running this code on MacBook M1. The following are my package details:
Python 3.11.4
absl-py==1.4.0
adlfs==2023.8.0
aiobotocore==2.5.4
aiofiles==23.1.0
aiohttp==3.8.4
aiohttp-cors==0.7.0
aioitertools==0.11.0
aioprocessing==2.0.1
aiorwlock==1.3.0
aiosignal==1.3.1
analytics-python==1.2.9
anyio==3.7.1
appdirs==1.4.4
appnope==0.1.3
argon2-cffi==21.3.0
argon2-cffi-bindings==21.2.0
Arpeggio==2.0.0
array-record==0.4.0
arrow==1.2.3
asttokens==2.2.1
astunparse==1.6.3
async-lru==2.0.4
async-timeout==4.0.2
attrs==22.2.0
azure-core==1.29.1
azure-datalake-store==0.0.53
azure-identity==1.14.0
azure-storage-blob==12.17.0
Babel==2.12.1
backcall==0.2.0
backoff==2.2.1
bayesian-optimization==1.4.3
beautifulsoup4==4.12.2
binaryornot==0.4.4
black==23.7.0
bleach==6.0.0
blessed==1.20.0
blinker==1.6.2
botocore==1.31.17
cachetools==5.3.1
certifi==2023.5.7
cffi==1.15.1
chardet==5.2.0
charset-normalizer==3.2.0
clearml==1.12.2
clearml-agent==1.5.2
click==8.1.4
cloudpickle==2.2.1
colorama==0.4.6
colorful==0.5.5
comm==0.1.4
contourpy==1.1.0
cookiecutter==2.3.0
croniter==1.4.1
cryptography==41.0.3
cycler==0.11.0
dataclasses-json==0.5.9
debugpy==1.6.7
decorator==5.1.1
defusedxml==0.7.1
Deprecated==1.2.14
dill==0.3.6
diskcache==5.6.1
distlib==0.3.7
dm-tree==0.1.8
docker==6.1.3
docker-image-py==0.1.12
docker-pycreds==0.4.0
docstring-parser==0.15
etils==1.3.0
executing==1.2.0
fastapi==0.101.0
fastjsonschema==2.18.0
filelock==3.12.0
Flask==2.3.2
Flask-Cors==4.0.0
flatbuffers==23.5.26
flyteidl==1.5.15
flytekit==1.8.3
flytekitplugins-pod==1.8.3
fonttools==4.42.0
fqdn==1.5.1
frozenlist==1.4.0
fsspec==2023.6.0
furl==2.1.3
gast==0.4.0
gcsfs==2023.6.0
gitdb==4.0.10
GitPython==3.1.32
google-api-core==2.11.1
google-auth==2.22.0
google-auth-oauthlib==1.0.0
google-cloud-core==2.3.3
google-cloud-storage==2.10.0
google-crc32c==1.5.0
google-pasta==0.2.0
google-resumable-media==2.5.0
googleapis-common-protos==1.59.1
gpustat==1.1
gql==3.4.1
graphql-core==3.2.3
grpcio==1.53.0
grpcio-status==1.53.0
h11==0.14.0
h5py==3.9.0
idna==3.4
importlib-metadata==6.8.0
importlib-resources==6.0.0
iniconfig==2.0.0
install==1.3.5
ipykernel==6.25.0
ipynbname==2023.2.0.0
ipython==8.14.0
isodate==0.6.1
isoduration==20.11.0
itsdangerous==2.1.2
janus==1.0.0
jaraco.classes==3.3.0
jedi==0.19.0
Jinja2==3.1.2
jmespath==1.0.1
joblib==1.3.1
json5==0.9.14
jsonpointer==2.4
jsonschema==4.19.0
jsonschema-specifications==2023.7.1
jupyter-events==0.7.0
jupyter-lsp==2.2.0
jupyter_client==8.3.0
jupyter_core==5.3.1
jupyter_server==2.7.0
jupyter_server_terminals==0.4.4
jupyterlab==4.0.4
jupyterlab-pygments==0.2.2
jupyterlab_server==2.24.0
keras==2.13.1
keyring==24.2.0
kiwisolver==1.4.4
kubernetes==27.2.0
libclang==16.0.0
lightgbm==4.0.0
lightning-utilities==0.9.0
llvmlite==0.40.1
Markdown==3.4.3
markdown-it-py==3.0.0
MarkupSafe==2.1.3
marshmallow==3.20.1
marshmallow-enum==1.5.1
marshmallow-jsonschema==0.13.0
matplotlib==3.7.2
matplotlib-inline==0.1.6
mdurl==0.1.2
mistune==3.0.1
more-itertools==10.1.0
mpmath==1.3.0
msal==1.23.0
msal-extensions==1.0.0
msgpack==1.0.5
multidict==6.0.4
mypy-extensions==1.0.0
natsort==8.4.0
nbclient==0.8.0
nbconvert==7.7.3
nbformat==5.9.2
nest-asyncio==1.5.7
networkx==3.1
notebook==7.0.2
notebook_shim==0.2.3
numba==0.57.1
numpy==1.24.3
nvidia-ml-py==12.535.77
oauthlib==3.2.2
objgraph==3.6.0
onnx==1.14.0
opencensus==0.11.2
opencensus-context==0.1.3
opt-einsum==3.3.0
orderedmultidict==1.0.1
overrides==7.4.0
packaging==23.1
pandas==1.5.3
pandocfilters==1.5.0
parso==0.8.3
parver==0.4
pathlib==1.0.1
pathlib2==2.3.7.post1
pathspec==0.11.2
pathtools==0.1.2
pexpect==4.8.0
pickleshare==0.7.5
Pillow==9.5.0
platformdirs==3.10.0
plotly==5.15.0
pluggy==1.0.0
portalocker==2.7.0
prometheus-client==0.17.1
promise==2.3
prompt-toolkit==3.0.39
protobuf==4.23.4
protoc-gen-swagger==0.1.0
psutil==5.9.5
ptyprocess==0.7.0
pulumi==3.73.0
pulumi-aws==5.41.0
pulumi-eks==1.0.2
pulumi-gcp==6.62.0
pulumi-kubernetes==3.30.0
pulumi-random==4.13.2
pure-eval==0.2.2
py-spy==0.3.14
pyarrow==10.0.1
pyasn1==0.5.0
pyasn1-modules==0.3.0
pycparser==2.21
pydantic==1.10.12
Pygments==2.16.1
PyJWT==2.4.0
pynndescent==0.5.10
pyOpenSSL==23.2.0
pyparsing==3.0.9
pytest==7.3.1
python-dateutil==2.8.2
python-json-logger==2.0.7
python-slugify==8.0.1
pytimeparse==1.1.8
pytorch-lightning==2.0.5
pytz==2023.3
PyYAML==6.0.1
pyzmq==25.1.0
ray==2.6.3
referencing==0.30.2
regex==2023.8.8
requests==2.28.2
requests-oauthlib==1.3.1
requests-toolbelt==0.10.1
rfc3339-validator==0.1.4
rfc3986-validator==0.1.1
rich==13.5.2
rich-click==1.6.1
rpds-py==0.9.2
rsa==4.9
s3fs==2023.6.0
scikit-learn==1.3.0
scipy==1.11.1
semver==2.13.0
Send2Trash==1.8.2
sentry-sdk==1.28.0
setproctitle==1.3.2
six==1.16.0
smart-open==6.3.0
smmap==5.0.0
sniffio==1.3.0
sortedcontainers==2.4.0
soupsieve==2.4.1
stack-data==0.6.2
starlette==0.27.0
statsd==3.3.0
subprocess.run==0.0.8
sympy==1.12
tenacity==8.2.2
tensorboard==2.13.0
tensorboard-data-server==0.7.1
tensorboardX==2.6.2
tensorflow==2.13.0
tensorflow-datasets==4.9.2
tensorflow-estimator==2.13.0
tensorflow-macos==2.13.0
tensorflow-metadata==1.13.1
termcolor==2.3.0
terminado==0.17.1
text-unidecode==1.3
threadpoolctl==3.1.0
tinycss2==1.2.1
toml==0.10.2
torch==2.0.1
torchmetrics==1.0.0
torchvision==0.15.2
tornado==6.3.2
tqdm==4.65.0
traitlets==5.9.0
tune-sklearn==0.4.6
typing-inspect==0.9.0
typing_extensions==4.5.0
tzdata==2023.3
umap-learn==0.5.3
uri-template==1.3.0
urllib3==1.26.16
uvicorn==0.23.2
virtualenv==20.21.0
wandb==0.15.8
wcwidth==0.2.6
weave==0.26.0
webcolors==1.13
webencodings==0.5.1
websocket-client==1.6.1
Werkzeug==2.3.6
wrapt==1.15.0
xgboost==1.7.6
xgboost-ray==0.1.17
yarl==1.9.2
zipp==3.16.0

### Reproduction script

```
import ray
import torchvision
import numpy as np
import torchvision.transforms as transforms
import torch
import torch.nn as nn
import torch.nn.functional as F
from ray import train
from ray.air import session, Checkpoint
from ray.train.torch import TorchCheckpoint
import torch.nn as nn
import torch.optim as optim
import torchvision
from ray.data.preprocessors import TorchVisionPreprocessor
from ray.train.torch import TorchTrainer
from ray.air.config import ScalingConfig
from ray.train.torch import TorchPredictor
from ray.train.batch_predictor import BatchPredictor

train_dataset = torchvision.datasets.CIFAR10("data", download=True, train=True)
test_dataset = torchvision.datasets.CIFAR10("data", download=True, train=False)

train_dataset: ray.data.Dataset = ray.data.from_torch(train_dataset)
test_dataset: ray.data.Dataset = ray.data.from_torch(test_dataset)

train_dataset

class Net(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = nn.Conv2d(3, 6, 5)
        self.pool = nn.MaxPool2d(2, 2)
        self.conv2 = nn.Conv2d(6, 16, 5)
        self.fc1 = nn.Linear(16 * 5 * 5, 120)
        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, 10)

    def forward(self, x):
        x = self.pool(F.relu(self.conv1(x)))
        x = self.pool(F.relu(self.conv2(x)))
        x = torch.flatten(x, 1)  # flatten all dimensions except batch
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x

def train_loop_per_worker(config):
    model = train.torch.prepare_model(Net())

    criterion = nn.CrossEntropyLoss()
    optimizer = optim.SGD(model.parameters(), lr=0.001, momentum=0.9)

    train_dataset_shard = session.get_dataset_shard("train")

    for epoch in range(2):
        running_loss = 0.0
        train_dataset_batches = train_dataset_shard.iter_torch_batches(
            batch_size=config["batch_size"],
        )
        for i, batch in enumerate(train_dataset_batches):
            # get the inputs and labels
            inputs, labels = batch["image"], batch["label"]

            # zero the parameter gradients
            optimizer.zero_grad()

            # forward + backward + optimize
            outputs = model(inputs)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()

            # print statistics
            running_loss += loss.item()
            if i % 2000 == 1999:  # print every 2000 mini-batches
                print(f"[{epoch + 1}, {i + 1:5d}] loss: {running_loss / 2000:.3f}")
                running_loss = 0.0

        metrics = dict(running_loss=running_loss)
        checkpoint = TorchCheckpoint.from_state_dict(model.state_dict())
        session.report(metrics, checkpoint=checkpoint)

transform = transforms.Compose(
    [transforms.ToTensor(), transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))]
)
preprocessor = TorchVisionPreprocessor(columns=["image"], transform=transform)

trainer = TorchTrainer(
    train_loop_per_worker=train_loop_per_worker,
    train_loop_config={"batch_size": 2},
    datasets={"train": train_dataset},
    scaling_config=ScalingConfig(num_workers=1),
    preprocessor=preprocessor
)
result = trainer.fit()
latest_checkpoint = result.checkpoint
```

### Issue Severity

None
xwjiang2010 commented 1 year ago

@matthewdeng Do you know what is the most updated version of this example? Maybe we could point the user to that.

matthewdeng commented 1 year ago

Maybe this one: https://docs.ray.io/en/master/train/examples/pytorch/convert_existing_pytorch_code_to_ray_train.html

zshareef commented 1 year ago

Hi @matthewdeng Thank you for the reply and the reference. Unfortunately, this example is also out-date. For example, in this code you have from ray.train import ScalingConfig Although now ScalingConfig is not the part of ray.train, rather it is part of ray.air.config.

After adjusting these things, I am getting the error:

AttributeError: module 'ray.train' has no attribute 'get_context'

matthewdeng commented 1 year ago

Ah yes this is for Ray 2.7, you can try out these new APIs with https://pypi.org/project/ray/2.7.0rc0/

anyscalesam commented 1 year ago

hey @zshareef - following up on the comment from @matthewdeng above we recently reworked our Train API layer as part of the Ray 2.7 release (which is now out!)

Can you please follow the aforementioned example from Matt and see if you're able to get it to run successfully?

anyscalesam commented 9 months ago

No response - closing. Please re-open if necessary.