ray-project / ray_lightning

Pytorch Lightning Distributed Accelerators using Ray
Apache License 2.0
211 stars 34 forks source link

support pytorch lightning 1.7 #196

Open JiahaoYao opened 2 years ago

JiahaoYao commented 2 years ago
sxjscience commented 2 years ago

Checking the CI failure about

E       TypeError: __init__() got an unexpected keyword argument 'checkpoint_callback'

It might be related to these two lines:

JiahaoYao commented 2 years ago

interesting, the other tests are hanging but they are good on the servers.

JiahaoYao commented 2 years ago

It seems that the memory is not enough for the ubuntu-latest

Does that mean for the tune, there is OOM issue?

JiahaoYao commented 2 years ago
[pytest on push/test_linux_ray_master_3]   ✅  Success - Install package
[pytest on push/test_linux_ray_master_3] ⭐  Run Test with Pytest
[pytest on push/test_linux_ray_master_3]   🐳  docker exec cmd=[bash --noprofile --norc -e -o pipefail /Users/jimmy/ScratchGym/Scratch/test0818/ray_lightning/workflow/4] user=
| /Users/jimmy/ScratchGym/Scratch/test0818/ray_lightning/ray_lightning/tests /Users/jimmy/ScratchGym/Scratch/test0818/ray_lightning
| ============================= test session starts ==============================
| platform linux -- Python 3.7.13, pytest-7.1.2, pluggy-1.0.0 -- /opt/hostedtoolcache/Python/3.7.13/x64/bin/python
| cachedir: .pytest_cache
| rootdir: /Users/jimmy/ScratchGym/Scratch/test0818/ray_lightning
collected 7 items
|
| test_tune.py::test_tune_iteration_ddp PASSED                             [ 14%]
| test_tune.py::test_tune_iteration_horovod PASSED                         [ 28%]
| test_tune.py::test_checkpoint_ddp PASSED                                 [ 42%]
| test_tune.py::test_checkpoint_horovod PASSED                             [ 57%]
| test_tune.py::test_checkpoint_ddp_gpu SKIPPED (test requires multi-G...) [ 71%]
| test_tune.py::test_checkpoint_horovod_gpu SKIPPED (test requires mul...) [ 85%]
| test_tune.py::test_tune_iteration_ddp_gpu SKIPPED (test requires mul...) [100%]
|
| =============================== warnings summary ===============================
| ../../../../../../../../opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages/horovod/torch/sync_batch_norm.py:33
| ../../../../../../../../opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages/horovod/torch/sync_batch_norm.py:33
|   /opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages/horovod/torch/sync_batch_norm.py:33: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
|     LooseVersion(torch.__version__) >= LooseVersion('1.5.0') and
|
| ../../../../../../../../opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages/horovod/torch/sync_batch_norm.py:34
| ../../../../../../../../opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages/horovod/torch/sync_batch_norm.py:34
|   /opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages/horovod/torch/sync_batch_norm.py:34: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
|     LooseVersion(torch.__version__) <= LooseVersion('1.6.0')
|
| ../../../../../../../../opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages/horovod/torch/sync_batch_norm.py:36
| ../../../../../../../../opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages/horovod/torch/sync_batch_norm.py:36
|   /opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages/horovod/torch/sync_batch_norm.py:36: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
|     _SYNC_BN_V3 = LooseVersion(torch.__version__) >= LooseVersion('1.6.0')
|
| ../../../../../../../../opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages/horovod/torch/sync_batch_norm.py:37
| ../../../../../../../../opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages/horovod/torch/sync_batch_norm.py:37
|   /opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages/horovod/torch/sync_batch_norm.py:37: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
|     _SYNC_BN_V4 = LooseVersion(torch.__version__) >= LooseVersion('1.9.0')
|
| ../../../../../../../../opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages/torch/utils/tensorboard/__init__.py:5
|   /opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages/torch/utils/tensorboard/__init__.py:5: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
|     tensorboard.__version__
|
| ../../../../../../../../opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages/torch/utils/tensorboard/__init__.py:6
|   /opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages/torch/utils/tensorboard/__init__.py:6: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
|     ) < LooseVersion("1.15"):
|
| ray_lightning/tests/test_tune.py::test_tune_iteration_ddp
| ray_lightning/tests/test_tune.py::test_tune_iteration_ddp
| ray_lightning/tests/test_tune.py::test_tune_iteration_horovod
| ray_lightning/tests/test_tune.py::test_tune_iteration_horovod
| ray_lightning/tests/test_tune.py::test_checkpoint_ddp
| ray_lightning/tests/test_tune.py::test_checkpoint_horovod
|   /opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages/ray/util/placement_group.py:80: DeprecationWarning: placement_group parameter is deprecated. Use scheduling_strategy=PlacementGroupSchedulingStrategy(...) instead, see the usage at https://docs.ray.io/en/master/ray-core/package-ref.html#ray-remote.
|     ).remote(self)
|
| ray_lightning/tests/test_tune.py::test_tune_iteration_ddp
| ray_lightning/tests/test_tune.py::test_tune_iteration_horovod
| ray_lightning/tests/test_tune.py::test_checkpoint_ddp
| ray_lightning/tests/test_tune.py::test_checkpoint_horovod
|   /opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages/ray/_private/ray_option_utils.py:291: DeprecationWarning: Setting 'object_store_memory' for actors is deprecated since it doesn't actually reserve the required object store memory. Use object spilling that's enabled by default (https://docs.ray.io/en/master/ray-core/objects/object-spilling.html) instead to bypass the object store memory size limitation.
|     stacklevel=1,
|
| ray_lightning/tests/test_tune.py::test_tune_iteration_ddp
| ray_lightning/tests/test_tune.py::test_tune_iteration_horovod
| ray_lightning/tests/test_tune.py::test_checkpoint_ddp
| ray_lightning/tests/test_tune.py::test_checkpoint_horovod
|   /opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages/ray/actor.py:637: DeprecationWarning: placement_group parameter is deprecated. Use scheduling_strategy=PlacementGroupSchedulingStrategy(...) instead, see the usage at https://docs.ray.io/en/master/ray-core/package-ref.html#ray-remote.
|     return actor_cls._remote(args=args, kwargs=kwargs, **updated_options)
|
| ray_lightning/tests/test_tune.py::test_tune_iteration_ddp
| ray_lightning/tests/test_tune.py::test_tune_iteration_horovod
| ray_lightning/tests/test_tune.py::test_checkpoint_ddp
| ray_lightning/tests/test_tune.py::test_checkpoint_horovod
|   /opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages/ray/actor.py:637: DeprecationWarning: placement_group_bundle_index parameter is deprecated. Use scheduling_strategy=PlacementGroupSchedulingStrategy(...) instead, see the usage at https://docs.ray.io/en/master/ray-core/package-ref.html#ray-remote.
|     return actor_cls._remote(args=args, kwargs=kwargs, **updated_options)
|
| ray_lightning/tests/test_tune.py::test_tune_iteration_ddp
| ray_lightning/tests/test_tune.py::test_tune_iteration_horovod
| ray_lightning/tests/test_tune.py::test_checkpoint_ddp
| ray_lightning/tests/test_tune.py::test_checkpoint_horovod
|   /opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages/ray/actor.py:637: DeprecationWarning: placement_group_capture_child_tasks parameter is deprecated. Use scheduling_strategy=PlacementGroupSchedulingStrategy(...) instead, see the usage at https://docs.ray.io/en/master/ray-core/package-ref.html#ray-remote.
|     return actor_cls._remote(args=args, kwargs=kwargs, **updated_options)
|
| -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
| ============================== slowest durations ===============================
| 12.07s call     ray_lightning/tests/test_tune.py::test_tune_iteration_ddp
| 11.92s call     ray_lightning/tests/test_tune.py::test_tune_iteration_horovod
| 9.61s call     ray_lightning/tests/test_tune.py::test_checkpoint_ddp
| 7.93s call     ray_lightning/tests/test_tune.py::test_checkpoint_horovod
| 3.66s setup    ray_lightning/tests/test_tune.py::test_tune_iteration_ddp
| 3.55s setup    ray_lightning/tests/test_tune.py::test_checkpoint_ddp
| 3.01s teardown ray_lightning/tests/test_tune.py::test_tune_iteration_ddp
| 2.97s teardown ray_lightning/tests/test_tune.py::test_tune_iteration_horovod
| 2.77s teardown ray_lightning/tests/test_tune.py::test_checkpoint_ddp
| 2.64s teardown ray_lightning/tests/test_tune.py::test_checkpoint_horovod
| 2.56s setup    ray_lightning/tests/test_tune.py::test_tune_iteration_horovod
| 2.46s setup    ray_lightning/tests/test_tune.py::test_checkpoint_horovod
|
| (6 durations < 0.005s hidden.  Use -vv to show these durations.)
| ============= 4 passed, 3 skipped, 32 warnings in 68.14s (0:01:08) =============
[pytest on push/test_linux_ray_master_3]   ✅  Success - Test with Pytest
marcosrdac commented 2 years ago

I just tested this PR and it worked fine on my cluster (training on 12 GPUs).

JiahaoYao commented 2 years ago

Any ideas to fix this in the ci test?

Requested labels: ubuntu-latest
Job defined at: ray-project/ray_lightning/.github/workflows/test.yaml@refs/pull/196/merge
Waiting for a runner to pick up this job...

Is there a typo mentioned here (https://github.com/orgs/community/discussions/31587)?

sxjscience commented 2 years ago

The CI error does not seem to be related to the PR:

        if not _JSONARGPARSE_SIGNATURES_AVAILABLE:
            raise ModuleNotFoundError(
>               f"{_JSONARGPARSE_SIGNATURES_AVAILABLE}. Try `pip install -U 'jsonargparse[signatures]'`."
            )
E           ModuleNotFoundError: Requirement 'jsonargparse[signatures]>=4.12.0' not met, DistributionNotFound: The 'docstring-parser>=0.15; extra == "signatures"' distribution was not found and is required by jsonargparse. Try `pip install -U 'jsonargparse[signatures]'`.

/opt/hostedtoolcache/Python/3.7.14/x64/lib/python3.7/site-packages/pytorch_lightning/cli.py:73: ModuleNotFoundError
=============================== warnings summary ===============================
sxjscience commented 2 years ago

@JiahaoYao Any plan to finish this PR?

JiahaoYao commented 2 years ago

the hanging issue still remains for the release test

== Status ==
Current time: 2022-10-03 17:49:05 (running for 00:14:54.03)
Memory usage on this node: 1.7/6.8 GiB
Using FIFO scheduling algorithm.
Resources requested: 2.0/2 CPUs, 0/0 GPUs, 0.0/3.56 GiB heap, 0.0/1.78 GiB objects
Result logdir: /home/runner/ray_results/tune_mnist
Number of trials: 1/1 (1 RUNNING)
+-------------------------+----------+-----------------+--------------+-----------+-----------+------------+
| Trial name              | status   | loc             |   batch_size |   layer_1 |   layer_2 |         lr |
|-------------------------+----------+-----------------+--------------+-----------+-----------+------------|
| train_mnist_97d14_00000 | RUNNING  | 10.1.0.228:5153 |           32 |        64 |        64 | 0.00670904 |
+-------------------------+----------+-----------------+--------------+-----------+-----------+------------+
== Status ==
Current time: 2022-10-03 17:49:10 (running for 00:14:59.03)
Memory usage on this node: 1.7/6.8 GiB
Using FIFO scheduling algorithm.
Resources requested: 2.0/2 CPUs, 0/0 GPUs, 0.0/3.56 GiB heap, 0.0/1.78 GiB objects
Result logdir: /home/runner/ray_results/tune_mnist
Number of trials: 1/1 (1 RUNNING)
+-------------------------+----------+-----------------+--------------+-----------+-----------+------------+
| Trial name              | status   | loc             |   batch_size |   layer_1 |   layer_2 |         lr |
|-------------------------+----------+-----------------+--------------+-----------+-----------+------------|
| train_mnist_97d14_00000 | RUNNING  | 10.1.0.228:5153 |           32 |        64 |        64 | 0.00670904 |
+-------------------------+----------+-----------------+--------------+-----------+-----------+------------+
aga-relation commented 1 year ago

Any updates on this? :)