Open JiahaoYao opened 2 years ago
Checking the CI failure about
E TypeError: __init__() got an unexpected keyword argument 'checkpoint_callback'
It might be related to these two lines:
interesting, the other tests are hanging but they are good on the servers.
It seems that the memory is not enough for the ubuntu-latest
test_horovod
does not start for test_tune.py
running ray_ddp_example.py with Tune
Does that mean for the tune, there is OOM issue?
[pytest on push/test_linux_ray_master_3] ✅ Success - Install package
[pytest on push/test_linux_ray_master_3] ⭐ Run Test with Pytest
[pytest on push/test_linux_ray_master_3] 🐳 docker exec cmd=[bash --noprofile --norc -e -o pipefail /Users/jimmy/ScratchGym/Scratch/test0818/ray_lightning/workflow/4] user=
| /Users/jimmy/ScratchGym/Scratch/test0818/ray_lightning/ray_lightning/tests /Users/jimmy/ScratchGym/Scratch/test0818/ray_lightning
| ============================= test session starts ==============================
| platform linux -- Python 3.7.13, pytest-7.1.2, pluggy-1.0.0 -- /opt/hostedtoolcache/Python/3.7.13/x64/bin/python
| cachedir: .pytest_cache
| rootdir: /Users/jimmy/ScratchGym/Scratch/test0818/ray_lightning
collected 7 items
|
| test_tune.py::test_tune_iteration_ddp PASSED [ 14%]
| test_tune.py::test_tune_iteration_horovod PASSED [ 28%]
| test_tune.py::test_checkpoint_ddp PASSED [ 42%]
| test_tune.py::test_checkpoint_horovod PASSED [ 57%]
| test_tune.py::test_checkpoint_ddp_gpu SKIPPED (test requires multi-G...) [ 71%]
| test_tune.py::test_checkpoint_horovod_gpu SKIPPED (test requires mul...) [ 85%]
| test_tune.py::test_tune_iteration_ddp_gpu SKIPPED (test requires mul...) [100%]
|
| =============================== warnings summary ===============================
| ../../../../../../../../opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages/horovod/torch/sync_batch_norm.py:33
| ../../../../../../../../opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages/horovod/torch/sync_batch_norm.py:33
| /opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages/horovod/torch/sync_batch_norm.py:33: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
| LooseVersion(torch.__version__) >= LooseVersion('1.5.0') and
|
| ../../../../../../../../opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages/horovod/torch/sync_batch_norm.py:34
| ../../../../../../../../opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages/horovod/torch/sync_batch_norm.py:34
| /opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages/horovod/torch/sync_batch_norm.py:34: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
| LooseVersion(torch.__version__) <= LooseVersion('1.6.0')
|
| ../../../../../../../../opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages/horovod/torch/sync_batch_norm.py:36
| ../../../../../../../../opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages/horovod/torch/sync_batch_norm.py:36
| /opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages/horovod/torch/sync_batch_norm.py:36: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
| _SYNC_BN_V3 = LooseVersion(torch.__version__) >= LooseVersion('1.6.0')
|
| ../../../../../../../../opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages/horovod/torch/sync_batch_norm.py:37
| ../../../../../../../../opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages/horovod/torch/sync_batch_norm.py:37
| /opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages/horovod/torch/sync_batch_norm.py:37: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
| _SYNC_BN_V4 = LooseVersion(torch.__version__) >= LooseVersion('1.9.0')
|
| ../../../../../../../../opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages/torch/utils/tensorboard/__init__.py:5
| /opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages/torch/utils/tensorboard/__init__.py:5: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
| tensorboard.__version__
|
| ../../../../../../../../opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages/torch/utils/tensorboard/__init__.py:6
| /opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages/torch/utils/tensorboard/__init__.py:6: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
| ) < LooseVersion("1.15"):
|
| ray_lightning/tests/test_tune.py::test_tune_iteration_ddp
| ray_lightning/tests/test_tune.py::test_tune_iteration_ddp
| ray_lightning/tests/test_tune.py::test_tune_iteration_horovod
| ray_lightning/tests/test_tune.py::test_tune_iteration_horovod
| ray_lightning/tests/test_tune.py::test_checkpoint_ddp
| ray_lightning/tests/test_tune.py::test_checkpoint_horovod
| /opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages/ray/util/placement_group.py:80: DeprecationWarning: placement_group parameter is deprecated. Use scheduling_strategy=PlacementGroupSchedulingStrategy(...) instead, see the usage at https://docs.ray.io/en/master/ray-core/package-ref.html#ray-remote.
| ).remote(self)
|
| ray_lightning/tests/test_tune.py::test_tune_iteration_ddp
| ray_lightning/tests/test_tune.py::test_tune_iteration_horovod
| ray_lightning/tests/test_tune.py::test_checkpoint_ddp
| ray_lightning/tests/test_tune.py::test_checkpoint_horovod
| /opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages/ray/_private/ray_option_utils.py:291: DeprecationWarning: Setting 'object_store_memory' for actors is deprecated since it doesn't actually reserve the required object store memory. Use object spilling that's enabled by default (https://docs.ray.io/en/master/ray-core/objects/object-spilling.html) instead to bypass the object store memory size limitation.
| stacklevel=1,
|
| ray_lightning/tests/test_tune.py::test_tune_iteration_ddp
| ray_lightning/tests/test_tune.py::test_tune_iteration_horovod
| ray_lightning/tests/test_tune.py::test_checkpoint_ddp
| ray_lightning/tests/test_tune.py::test_checkpoint_horovod
| /opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages/ray/actor.py:637: DeprecationWarning: placement_group parameter is deprecated. Use scheduling_strategy=PlacementGroupSchedulingStrategy(...) instead, see the usage at https://docs.ray.io/en/master/ray-core/package-ref.html#ray-remote.
| return actor_cls._remote(args=args, kwargs=kwargs, **updated_options)
|
| ray_lightning/tests/test_tune.py::test_tune_iteration_ddp
| ray_lightning/tests/test_tune.py::test_tune_iteration_horovod
| ray_lightning/tests/test_tune.py::test_checkpoint_ddp
| ray_lightning/tests/test_tune.py::test_checkpoint_horovod
| /opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages/ray/actor.py:637: DeprecationWarning: placement_group_bundle_index parameter is deprecated. Use scheduling_strategy=PlacementGroupSchedulingStrategy(...) instead, see the usage at https://docs.ray.io/en/master/ray-core/package-ref.html#ray-remote.
| return actor_cls._remote(args=args, kwargs=kwargs, **updated_options)
|
| ray_lightning/tests/test_tune.py::test_tune_iteration_ddp
| ray_lightning/tests/test_tune.py::test_tune_iteration_horovod
| ray_lightning/tests/test_tune.py::test_checkpoint_ddp
| ray_lightning/tests/test_tune.py::test_checkpoint_horovod
| /opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages/ray/actor.py:637: DeprecationWarning: placement_group_capture_child_tasks parameter is deprecated. Use scheduling_strategy=PlacementGroupSchedulingStrategy(...) instead, see the usage at https://docs.ray.io/en/master/ray-core/package-ref.html#ray-remote.
| return actor_cls._remote(args=args, kwargs=kwargs, **updated_options)
|
| -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
| ============================== slowest durations ===============================
| 12.07s call ray_lightning/tests/test_tune.py::test_tune_iteration_ddp
| 11.92s call ray_lightning/tests/test_tune.py::test_tune_iteration_horovod
| 9.61s call ray_lightning/tests/test_tune.py::test_checkpoint_ddp
| 7.93s call ray_lightning/tests/test_tune.py::test_checkpoint_horovod
| 3.66s setup ray_lightning/tests/test_tune.py::test_tune_iteration_ddp
| 3.55s setup ray_lightning/tests/test_tune.py::test_checkpoint_ddp
| 3.01s teardown ray_lightning/tests/test_tune.py::test_tune_iteration_ddp
| 2.97s teardown ray_lightning/tests/test_tune.py::test_tune_iteration_horovod
| 2.77s teardown ray_lightning/tests/test_tune.py::test_checkpoint_ddp
| 2.64s teardown ray_lightning/tests/test_tune.py::test_checkpoint_horovod
| 2.56s setup ray_lightning/tests/test_tune.py::test_tune_iteration_horovod
| 2.46s setup ray_lightning/tests/test_tune.py::test_checkpoint_horovod
|
| (6 durations < 0.005s hidden. Use -vv to show these durations.)
| ============= 4 passed, 3 skipped, 32 warnings in 68.14s (0:01:08) =============
[pytest on push/test_linux_ray_master_3] ✅ Success - Test with Pytest
I just tested this PR and it worked fine on my cluster (training on 12 GPUs).
Any ideas to fix this in the ci test?
Requested labels: ubuntu-latest
Job defined at: ray-project/ray_lightning/.github/workflows/test.yaml@refs/pull/196/merge
Waiting for a runner to pick up this job...
Is there a typo mentioned here (https://github.com/orgs/community/discussions/31587)?
The CI error does not seem to be related to the PR:
if not _JSONARGPARSE_SIGNATURES_AVAILABLE:
raise ModuleNotFoundError(
> f"{_JSONARGPARSE_SIGNATURES_AVAILABLE}. Try `pip install -U 'jsonargparse[signatures]'`."
)
E ModuleNotFoundError: Requirement 'jsonargparse[signatures]>=4.12.0' not met, DistributionNotFound: The 'docstring-parser>=0.15; extra == "signatures"' distribution was not found and is required by jsonargparse. Try `pip install -U 'jsonargparse[signatures]'`.
/opt/hostedtoolcache/Python/3.7.14/x64/lib/python3.7/site-packages/pytorch_lightning/cli.py:73: ModuleNotFoundError
=============================== warnings summary ===============================
@JiahaoYao Any plan to finish this PR?
the hanging issue still remains for the release test
== Status ==
Current time: 2022-10-03 17:49:05 (running for 00:14:54.03)
Memory usage on this node: 1.7/6.8 GiB
Using FIFO scheduling algorithm.
Resources requested: 2.0/2 CPUs, 0/0 GPUs, 0.0/3.56 GiB heap, 0.0/1.78 GiB objects
Result logdir: /home/runner/ray_results/tune_mnist
Number of trials: 1/1 (1 RUNNING)
+-------------------------+----------+-----------------+--------------+-----------+-----------+------------+
| Trial name | status | loc | batch_size | layer_1 | layer_2 | lr |
|-------------------------+----------+-----------------+--------------+-----------+-----------+------------|
| train_mnist_97d14_00000 | RUNNING | 10.1.0.228:5153 | 32 | 64 | 64 | 0.00670904 |
+-------------------------+----------+-----------------+--------------+-----------+-----------+------------+
== Status ==
Current time: 2022-10-03 17:49:10 (running for 00:14:59.03)
Memory usage on this node: 1.7/6.8 GiB
Using FIFO scheduling algorithm.
Resources requested: 2.0/2 CPUs, 0/0 GPUs, 0.0/3.56 GiB heap, 0.0/1.78 GiB objects
Result logdir: /home/runner/ray_results/tune_mnist
Number of trials: 1/1 (1 RUNNING)
+-------------------------+----------+-----------------+--------------+-----------+-----------+------------+
| Trial name | status | loc | batch_size | layer_1 | layer_2 | lr |
|-------------------------+----------+-----------------+--------------+-----------+-----------+------------|
| train_mnist_97d14_00000 | RUNNING | 10.1.0.228:5153 | 32 | 64 | 64 | 0.00670904 |
+-------------------------+----------+-----------------+--------------+-----------+-----------+------------+
Any updates on this? :)