ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.9k stars 5.76k forks source link

[tune] `tune.run()` hangs on local cluster when using a functional trainable with `reuse_actors=True` #18808

Closed jmakov closed 3 years ago

jmakov commented 3 years ago

Search before asking

Ray Component

Ray Tune

What happened + What you expected to happen

Update: After investigating, it appears that reuse_actors=True is the culprit, causing a cluster to hang with unfulfilled resource requirements. Setting to False solves the issue. Looks like an issue between modin and ray.

tune.run() starts work on a local cluster. After a couple of minutes less and less CPUs are used. After no CPU is utilized, tune.run() still hasn't finished. The expected behavior is that after tune.run() all cluster resources are utilized until tune.run() finishes.

Additional info: ray monitor cluster.yaml shows that all CPUs are in use.

The same behavior occurs with/without:

"""
TL;DR: using the options below or not, Tune hangs
If this line enabled without resources_per_trail, we see the same behavior as if this line would be 
commented out. The behavior is again the same if none of the options are used e.g. this line is 
commented out and resources_per_trail are not used in tune.run()
"""
os.environ["TUNE_PLACEMENT_GROUP_AUTO_DISABLED"] = "1"

tune.run(..., resources_per_trial={"cpu": 1, "gpu": 0})

tune.run(...) output:

# a table with status `PENDING` then this:
2021-09-22 12:28:01,957 WARNING worker.py:1215 -- The actor or task with ID 92829bc7597dbf8125fc8cfe9e99e6f86345a8ee426e5853 cannot be scheduled right now. You can ignore this message if this Ray cluster is expected to auto-scale or if you specified a runtime_env for this actor or task, which may take time to install.  Otherwise, this is likely due to all cluster resources being claimed by actors. To resolve the issue, consider creating fewer actors or increasing the resources available to this Ray cluster.
Required resources for this actor or task: {CPU_group_a4f7e07fa3d7f5cbb46a12669fd39d73: 1.000000}
Available resources on this node: {0.000000/8.000000 CPU, 10.253578 GiB/10.253578 GiB memory, 1.000000/1.000000 GPU, 4.394390 GiB/4.394390 GiB object_store_memory, 0.000000/1.000000 CPU_group_e1ef645d445c2e0e92a983a70f11000f, 1000.000000/1000.000000 bundle_group_0_29fa53e9ea89dd502cb92f1efd4b3771, 0.000000/1.000000 CPU_group_0_29fa53e9ea89dd502cb92f1efd4b3771, 1000.000000/1000.000000 bundle_group_29fa53e9ea89dd502cb92f1efd4b3771, 1000.000000/1000.000000 bundle_group_1b660658ad0be64124267792baef94df, 0.000000/1.000000 CPU_group_1b660658ad0be64124267792baef94df, 1000.000000/1000.000000 bundle_group_0_fd5f02ce364d16bd7e212f93ea4f9058, 1000.000000/1000.000000 bundle_group_3f68eb1d6967cac7bd655b310a561c18, 1000.000000/1000.000000 bundle_group_b1ab33d8146613f99b7ec86c328f670d, 0.000000/1.000000 CPU_group_b1ab33d8146613f99b7ec86c328f670d, 1000.000000/1000.000000 bundle_group_0_b1ab33d8146613f99b7ec86c328f670d, 1000.000000/1000.000000 bundle_group_0_1b660658ad0be64124267792baef94df, 1.000000/1.000000 node:192.168.0.102, 1000.000000/1000.000000 bundle_group_0_e1ef645d445c2e0e92a983a70f11000f, 0.000000/1.000000 CPU_group_29fa53e9ea89dd502cb92f1efd4b3771, 1000.000000/1000.000000 bundle_group_0_3f68eb1d6967cac7bd655b310a561c18, 0.000000/1.000000 CPU_group_0_fd5f02ce364d16bd7e212f93ea4f9058, 0.000000/1.000000 CPU_group_0_e1ef645d445c2e0e92a983a70f11000f, 0.000000/1.000000 CPU_group_fd5f02ce364d16bd7e212f93ea4f9058, 0.000000/1.000000 CPU_group_0_3f68eb1d6967cac7bd655b310a561c18, 1000.000000/1000.000000 bundle_group_fd5f02ce364d16bd7e212f93ea4f9058, 1000.000000/1000.000000 bundle_group_0_34f50cf62b1304bd01fde72b0d08f335, 1000.000000/1000.000000 bundle_group_34f50cf62b1304bd01fde72b0d08f335, 0.000000/1.000000 CPU_group_34f50cf62b1304bd01fde72b0d08f335, 1.000000/1.000000 accelerator_type:GT, 1000.000000/1000.000000 bundle_group_0_a4f7e07fa3d7f5cbb46a12669fd39d73, 0.000000/1.000000 CPU_group_a4f7e07fa3d7f5cbb46a12669fd39d73, 0.000000/1.000000 CPU_group_0_34f50cf62b1304bd01fde72b0d08f335, 0.000000/1.000000 CPU_group_0_b1ab33d8146613f99b7ec86c328f670d, 0.000000/1.000000 CPU_group_3f68eb1d6967cac7bd655b310a561c18, 1000.000000/1000.000000 bundle_group_e1ef645d445c2e0e92a983a70f11000f, 0.000000/1.000000 CPU_group_0_1b660658ad0be64124267792baef94df, 1000.000000/1000.000000 bundle_group_a4f7e07fa3d7f5cbb46a12669fd39d73, 0.000000/1.000000 CPU_group_0_a4f7e07fa3d7f5cbb46a12669fd39d73}
In total there are 3 pending tasks and 0 pending actors on this node.

# another table with status `RUNNING` followed by
2021-09-22 12:28:04,725 WARNING worker.py:1215 -- The actor or task with ID abb0af8900b301e5b0596bee642541e31a2e091eca36d06e cannot be scheduled right now. You can ignore this message if this Ray cluster is expected to auto-scale or if you specified a runtime_env for this actor or task, which may take time to install.  Otherwise, this is likely due to all cluster resources being claimed by actors. To resolve the issue, consider creating fewer actors or increasing the resources available to this Ray cluster.
Required resources for this actor or task: {CPU_group_5050f97cfd67f334db742b630c885293: 1.000000}
Available resources on this node: {0.000000/32.000000 CPU, 77.083662 GiB/77.083662 GiB memory, 37.027238 GiB/37.027238 GiB object_store_memory, 1000.000000/1000.000000 bundle_group_0_193abf37542a2313833731b4eca71731, 0.000000/1.000000 CPU_group_0_a7d786474ba3a15ae1e71513111549f6, 0.000000/1.000000 CPU_group_1c01220eabdedeb653cc26428c1f942c, 0.000000/1.000000 CPU_group_0_6b920a2e2161d744da9d61933d7bc185, 1000.000000/1000.000000 bundle_group_0_1ef6f95503b8a5ca9d89eb042587c8cd, 1000.000000/1000.000000 bundle_group_0_8347f822527c2fab71faa867917997bc, 0.000000/1.000000 CPU_group_0_6515d3ebe6ec588fa543608d04c257c3, 1000.000000/1000.000000 bundle_group_0_0fb868db197312edb8dc9463d2f3d040, 0.000000/1.000000 CPU_group_aaa5ad16ca69384c44b791eec4fb411d, 0.000000/1.000000 CPU_group_80a316d9455db4e8218f62f38311c844, 0.000000/1.000000 CPU_group_6515d3ebe6ec588fa543608d04c257c3, 0.000000/1.000000 CPU_group_8713c70ec590e6b62a6caf243fac8bf8, 1000.000000/1000.000000 bundle_group_0_77b8626d317f86c986611d59662116ff, 1000.000000/1000.000000 bundle_group_947464e62a5b00db4deb3362c7ad9495, 1000.000000/1000.000000 bundle_group_24ed97ee8dfb407354c89b68c78656e6, 0.000000/1.000000 CPU_group_0_77b8626d317f86c986611d59662116ff, 0.000000/1.000000 CPU_group_e05c71ab74f6385222e79ba572affc7b, 0.000000/1.000000 CPU_group_193abf37542a2313833731b4eca71731, 0.000000/1.000000 CPU_group_77b8626d317f86c986611d59662116ff, 1000.000000/1000.000000 bundle_group_8e91100727acacbec6ca1dc97e8b31ef, 0.000000/1.000000 CPU_group_56b1bd13fcdbb470e814dfa98fafa33c, 1000.000000/1000.000000 bundle_group_5050f97cfd67f334db742b630c885293, 0.000000/1.000000 CPU_group_0_193abf37542a2313833731b4eca71731, 1000.000000/1000.000000 bundle_group_56b1bd13fcdbb470e814dfa98fafa33c, 1000.000000/1000.000000 bundle_group_65ba6b7d2040c6c5369327a2f8609d75, 0.000000/1.000000 CPU_group_8bdc69e057b284f07da0c3e78d062f32, 1000.000000/1000.000000 bundle_group_556bd3553dfacc60fbc15e8372906955, 1000.000000/1000.000000 bundle_group_e05c71ab74f6385222e79ba572affc7b, 1000.000000/1000.000000 bundle_group_193abf37542a2313833731b4eca71731, 1000.000000/1000.000000 bundle_group_6b920a2e2161d744da9d61933d7bc185, 1000.000000/1000.000000 bundle_group_77b8626d317f86c986611d59662116ff, 1000.000000/1000.000000 bundle_group_8347f822527c2fab71faa867917997bc, 1000.000000/1000.000000 bundle_group_0_556bd3553dfacc60fbc15e8372906955, 1000.000000/1000.000000 bundle_group_0_aaa5ad16ca69384c44b791eec4fb411d, 0.000000/1.000000 CPU_group_6b920a2e2161d744da9d61933d7bc185, 0.000000/1.000000 CPU_group_a556d82aa28ecbd56fec45940d6070c3, 1000.000000/1000.000000 bundle_group_0_24ed97ee8dfb407354c89b68c78656e6, 1000.000000/1000.000000 bundle_group_0_c901fbcfda2160c66ce29192dd6fc859, 1000.000000/1000.000000 bundle_group_b0d5419682bc8250d68876e107714d3a, 0.000000/1.000000 CPU_group_947464e62a5b00db4deb3362c7ad9495, 0.000000/1.000000 CPU_group_0_1c01220eabdedeb653cc26428c1f942c, 1000.000000/1000.000000 bundle_group_2121396ce01654bdab0b47a52f7c416d, 1000.000000/1000.000000 bundle_group_8713c70ec590e6b62a6caf243fac8bf8, 1.000000/1.000000 node:192.168.0.101, 1000.000000/1000.000000 bundle_group_1c01220eabdedeb653cc26428c1f942c, 0.000000/1.000000 CPU_group_2121396ce01654bdab0b47a52f7c416d, 1000.000000/1000.000000 bundle_group_6515d3ebe6ec588fa543608d04c257c3, 1000.000000/1000.000000 bundle_group_46d75e4dd5b9e62e63fd7069f95c3a6b, 0.000000/1.000000 CPU_group_dde009bc84ce620eff4c4bce8b87ee93, 0.000000/1.000000 CPU_group_0_8347f822527c2fab71faa867917997bc, 1000.000000/1000.000000 bundle_group_0_8bdc69e057b284f07da0c3e78d062f32, 0.000000/1.000000 CPU_group_65ba6b7d2040c6c5369327a2f8609d75, 0.000000/1.000000 CPU_group_0_c901fbcfda2160c66ce29192dd6fc859, 0.000000/1.000000 CPU_group_0_0fb868db197312edb8dc9463d2f3d040, 1000.000000/1000.000000 bundle_group_0_b7c2d2b2f98444fd7dcc55ea6e4e501b, 1000.000000/1000.000000 bundle_group_0_6b920a2e2161d744da9d61933d7bc185, 1000.000000/1000.000000 bundle_group_80a316d9455db4e8218f62f38311c844, 0.000000/1.000000 CPU_group_0_a556d82aa28ecbd56fec45940d6070c3, 0.000000/1.000000 CPU_group_0_46d75e4dd5b9e62e63fd7069f95c3a6b, 0.000000/1.000000 CPU_group_46d75e4dd5b9e62e63fd7069f95c3a6b, 0.000000/1.000000 CPU_group_0_65ba6b7d2040c6c5369327a2f8609d75, 0.000000/1.000000 CPU_group_c901fbcfda2160c66ce29192dd6fc859, 1000.000000/1000.000000 bundle_group_0_8e91100727acacbec6ca1dc97e8b31ef, 0.000000/1.000000 CPU_group_0_c91005e703ec8eb2e81bde11f409625c, 0.000000/1.000000 CPU_group_c91005e703ec8eb2e81bde11f409625c, 1000.000000/1000.000000 bundle_group_0_b0d5419682bc8250d68876e107714d3a, 0.000000/1.000000 CPU_group_0_b0d5419682bc8250d68876e107714d3a, 1000.000000/1000.000000 bundle_group_c901fbcfda2160c66ce29192dd6fc859, 1000.000000/1000.000000 bundle_group_0_6515d3ebe6ec588fa543608d04c257c3, 0.000000/1.000000 CPU_group_0_304289cb798dced10e785ec7df7cb90d, 1000.000000/1000.000000 bundle_group_0_a556d82aa28ecbd56fec45940d6070c3, 0.000000/1.000000 CPU_group_0_aaa5ad16ca69384c44b791eec4fb411d, 0.000000/1.000000 CPU_group_304289cb798dced10e785ec7df7cb90d, 0.000000/1.000000 CPU_group_0_8e91100727acacbec6ca1dc97e8b31ef, 1000.000000/1000.000000 bundle_group_0_c91005e703ec8eb2e81bde11f409625c, 0.000000/1.000000 CPU_group_0_56b1bd13fcdbb470e814dfa98fafa33c, 1000.000000/1000.000000 bundle_group_0_5050f97cfd67f334db742b630c885293, 0.000000/1.000000 CPU_group_0fb868db197312edb8dc9463d2f3d040, 0.000000/1.000000 CPU_group_0_5050f97cfd67f334db742b630c885293, 1000.000000/1000.000000 bundle_group_0fb868db197312edb8dc9463d2f3d040, 1000.000000/1000.000000 bundle_group_b7c2d2b2f98444fd7dcc55ea6e4e501b, 1000.000000/1000.000000 bundle_group_0_947464e62a5b00db4deb3362c7ad9495, 0.000000/1.000000 CPU_group_0_556bd3553dfacc60fbc15e8372906955, 1000.000000/1000.000000 bundle_group_dde009bc84ce620eff4c4bce8b87ee93, 0.000000/1.000000 CPU_group_0_1ef6f95503b8a5ca9d89eb042587c8cd, 1000.000000/1000.000000 bundle_group_0_2121396ce01654bdab0b47a52f7c416d, 1000.000000/1000.000000 bundle_group_0_e05c71ab74f6385222e79ba572affc7b, 1000.000000/1000.000000 bundle_group_0_dde009bc84ce620eff4c4bce8b87ee93, 0.000000/1.000000 CPU_group_5050f97cfd67f334db742b630c885293, 0.000000/1.000000 CPU_group_a7d786474ba3a15ae1e71513111549f6, 0.000000/1.000000 CPU_group_b7c2d2b2f98444fd7dcc55ea6e4e501b, 0.000000/1.000000 CPU_group_0_e05c71ab74f6385222e79ba572affc7b, 1000.000000/1000.000000 bundle_group_a556d82aa28ecbd56fec45940d6070c3, 0.000000/1.000000 CPU_group_2fc0f6b828afd4f42e32785481df14f3, 1000.000000/1000.000000 bundle_group_304289cb798dced10e785ec7df7cb90d, 1000.000000/1000.000000 bundle_group_0_abddd0959b3b6dc318aaa1e2fedc89c6, 1000.000000/1000.000000 bundle_group_0_1c01220eabdedeb653cc26428c1f942c, 0.000000/1.000000 CPU_group_0_947464e62a5b00db4deb3362c7ad9495, 0.000000/1.000000 CPU_group_0_b7c2d2b2f98444fd7dcc55ea6e4e501b, 1000.000000/1000.000000 bundle_group_0_8713c70ec590e6b62a6caf243fac8bf8, 0.000000/1.000000 CPU_group_1ef6f95503b8a5ca9d89eb042587c8cd, 0.000000/1.000000 CPU_group_8347f822527c2fab71faa867917997bc, 1000.000000/1000.000000 bundle_group_0_80a316d9455db4e8218f62f38311c844, 0.000000/1.000000 CPU_group_0_abddd0959b3b6dc318aaa1e2fedc89c6, 0.000000/1.000000 CPU_group_0_2121396ce01654bdab0b47a52f7c416d, 1000.000000/1000.000000 bundle_group_8bdc69e057b284f07da0c3e78d062f32, 1000.000000/1000.000000 bundle_group_0_65ba6b7d2040c6c5369327a2f8609d75, 1000.000000/1000.000000 bundle_group_1ef6f95503b8a5ca9d89eb042587c8cd, 1000.000000/1000.000000 bundle_group_a7d786474ba3a15ae1e71513111549f6, 1000.000000/1000.000000 bundle_group_0_304289cb798dced10e785ec7df7cb90d, 1000.000000/1000.000000 bundle_group_0_56b1bd13fcdbb470e814dfa98fafa33c, 0.000000/1.000000 CPU_group_0_80a316d9455db4e8218f62f38311c844, 1000.000000/1000.000000 bundle_group_0_2fc0f6b828afd4f42e32785481df14f3, 0.000000/1.000000 CPU_group_556bd3553dfacc60fbc15e8372906955, 0.000000/1.000000 CPU_group_0_8713c70ec590e6b62a6caf243fac8bf8, 0.000000/1.000000 CPU_group_0_2fc0f6b828afd4f42e32785481df14f3, 0.000000/1.000000 CPU_group_b0d5419682bc8250d68876e107714d3a, 1000.000000/1000.000000 bundle_group_c91005e703ec8eb2e81bde11f409625c, 0.000000/1.000000 CPU_group_24ed97ee8dfb407354c89b68c78656e6, 0.000000/1.000000 CPU_group_8e91100727acacbec6ca1dc97e8b31ef, 0.000000/1.000000 CPU_group_0_24ed97ee8dfb407354c89b68c78656e6, 1000.000000/1000.000000 bundle_group_abddd0959b3b6dc318aaa1e2fedc89c6, 0.000000/1.000000 CPU_group_0_dde009bc84ce620eff4c4bce8b87ee93, 0.000000/1.000000 CPU_group_0_8bdc69e057b284f07da0c3e78d062f32, 1000.000000/1000.000000 bundle_group_0_46d75e4dd5b9e62e63fd7069f95c3a6b, 1000.000000/1000.000000 bundle_group_2fc0f6b828afd4f42e32785481df14f3, 0.000000/1.000000 CPU_group_abddd0959b3b6dc318aaa1e2fedc89c6, 1000.000000/1000.000000 bundle_group_aaa5ad16ca69384c44b791eec4fb411d, 1000.000000/1000.000000 bundle_group_0_a7d786474ba3a15ae1e71513111549f6}
In total there are 13 pending tasks and 0 pending actors on this node.
2021-09-22 12:28:09,270 WARNING worker.py:1215 -- The actor or task with ID d387096ea3ee0d5f520b4dcea46fe42080c1e9ba3cd4d216 cannot be scheduled right now. You can ignore this message if this Ray cluster is expected to auto-scale or if you specified a runtime_env for this actor or task, which may take time to install.  Otherwise, this is likely due to all cluster resources being claimed by actors. To resolve the issue, consider creating fewer actors or increasing the resources available to this Ray cluster.
Required resources for this actor or task: {CPU_group_2b4937db4304aec22e4525edabd0e4f9: 1.000000}
Available resources on this node: {0.000000/12.000000 CPU, 17.248753 GiB/17.248753 GiB memory, 1.000000/1.000000 GPU, 7.392323 GiB/7.392323 GiB object_store_memory, 1000.000000/1000.000000 bundle_group_0_58b8f4744880fb0c4edf7766c36edd3e, 1000.000000/1000.000000 bundle_group_674ce645bc677d793fe6977f0e105921, 1000.000000/1000.000000 bundle_group_0_d10a5149dba1511ebf09093ca4166430, 0.000000/1.000000 CPU_group_0_96020ed730e7287440800ce6120ce49f, 1000.000000/1000.000000 bundle_group_0_96020ed730e7287440800ce6120ce49f, 0.000000/1.000000 CPU_group_0_0c26bbfe98ba452eced78d504bdd1fdf, 0.000000/1.000000 CPU_group_0c26bbfe98ba452eced78d504bdd1fdf, 1000.000000/1000.000000 bundle_group_0_de8881532a9bb4eda69ac92145e3db87, 0.000000/1.000000 CPU_group_de8881532a9bb4eda69ac92145e3db87, 1000.000000/1000.000000 bundle_group_0_674ce645bc677d793fe6977f0e105921, 0.000000/1.000000 CPU_group_0_d10a5149dba1511ebf09093ca4166430, 1000.000000/1000.000000 bundle_group_0_0c26bbfe98ba452eced78d504bdd1fdf, 1000.000000/1000.000000 bundle_group_7dd84024aebe7b756681e03f65292291, 0.000000/1.000000 CPU_group_96020ed730e7287440800ce6120ce49f, 0.000000/1.000000 CPU_group_0_674ce645bc677d793fe6977f0e105921, 0.000000/1.000000 CPU_group_d10a5149dba1511ebf09093ca4166430, 0.000000/1.000000 CPU_group_d41544db1430ca75b80c3a1e7cb1b278, 0.000000/1.000000 CPU_group_0_58b8f4744880fb0c4edf7766c36edd3e, 1.000000/1.000000 node:192.168.0.100, 1000.000000/1000.000000 bundle_group_f23a94c2c0ecd35abc4eff553ba5fe66, 1000.000000/1000.000000 bundle_group_2b4937db4304aec22e4525edabd0e4f9, 0.000000/1.000000 CPU_group_0_2b4937db4304aec22e4525edabd0e4f9, 1000.000000/1000.000000 bundle_group_0_d0f3b8886e51211fd2adc1afca4d6a8c, 1000.000000/1000.000000 bundle_group_d0f3b8886e51211fd2adc1afca4d6a8c, 1000.000000/1000.000000 bundle_group_0_d41544db1430ca75b80c3a1e7cb1b278, 1000.000000/1000.000000 bundle_group_0_7dd84024aebe7b756681e03f65292291, 1000.000000/1000.000000 bundle_group_0_23fab51ae37bd70594c7975f5918e63c, 1000.000000/1000.000000 bundle_group_23fab51ae37bd70594c7975f5918e63c, 0.000000/1.000000 CPU_group_0_23fab51ae37bd70594c7975f5918e63c, 0.000000/1.000000 CPU_group_23fab51ae37bd70594c7975f5918e63c, 1000.000000/1000.000000 bundle_group_0_2b4937db4304aec22e4525edabd0e4f9, 1000.000000/1000.000000 bundle_group_58b8f4744880fb0c4edf7766c36edd3e, 0.000000/1.000000 CPU_group_58b8f4744880fb0c4edf7766c36edd3e, 0.000000/1.000000 CPU_group_0_d0f3b8886e51211fd2adc1afca4d6a8c, 0.000000/1.000000 CPU_group_0_d41544db1430ca75b80c3a1e7cb1b278, 1.000000/1.000000 accelerator_type:G, 1000.000000/1000.000000 bundle_group_96020ed730e7287440800ce6120ce49f, 1000.000000/1000.000000 bundle_group_0c26bbfe98ba452eced78d504bdd1fdf, 1000.000000/1000.000000 bundle_group_0_f23a94c2c0ecd35abc4eff553ba5fe66, 0.000000/1.000000 CPU_group_0_7dd84024aebe7b756681e03f65292291, 0.000000/1.000000 CPU_group_0_de8881532a9bb4eda69ac92145e3db87, 0.000000/1.000000 CPU_group_674ce645bc677d793fe6977f0e105921, 0.000000/1.000000 CPU_group_2b4937db4304aec22e4525edabd0e4f9, 0.000000/1.000000 CPU_group_f23a94c2c0ecd35abc4eff553ba5fe66, 1000.000000/1000.000000 bundle_group_d10a5149dba1511ebf09093ca4166430, 0.000000/1.000000 CPU_group_d0f3b8886e51211fd2adc1afca4d6a8c, 0.000000/1.000000 CPU_group_0_f23a94c2c0ecd35abc4eff553ba5fe66, 0.000000/1.000000 CPU_group_7dd84024aebe7b756681e03f65292291, 1000.000000/1000.000000 bundle_group_de8881532a9bb4eda69ac92145e3db87, 1000.000000/1000.000000 bundle_group_d41544db1430ca75b80c3a1e7cb1b278}
In total there are 8 pending tasks and 0 pending actors on this node.

Reproduction script

import modin.pandas as pd 
import ray
from ray import tune
from ray.tune.suggest.basic_variant import BasicVariantGenerator

ray.init(address='auto', _redis_password='xxx')

def easy_objective(config, data):
    data_df = data[0]

    # Here be dragons. If either of the below lines are included, Tune hangs.
    score = int(pd.DataFrame(pd.Series(df.test), columns=["test"]).explode(["test"]).test.sum()) 
    # pd.DataFrame(pd.Series(df.test), columns=["test"]).explode(["test"])
    # pd.DataFrame(pd.Series(df.test), columns=["test"]).sum()  

    tune.report(score=score)

tune.run(
    tune.with_parameters(easy_objective, data=[df.index.values, df.bid.values, df.ask.values, df.decimals_price[0]]),
    name="test_study",
    time_budget_s=3600*24*3,
    num_samples=-1,
    verbose=3,
    fail_fast=True,
    config={
            "steps": 100,
            "width": tune.uniform(0, 20),
            "height": tune.uniform(-100, 100),
            "activation": tune.grid_search(["relu", "tanh"])
        },
    metric="score", 
    mode="max",
# but works with this enabled
#    search_alg=BasicVariantGenerator(max_concurrent=CLUSTER_AVAILABLE_LOGICAL_CPUS - 1),  #N.B. "-1", else hangs
)

Anything else

Are you willing to submit a PR?

jmakov commented 3 years ago

Using log_to_file=True, trialdir/stdout and trialdir/stderr also aren't present.

jmakov commented 3 years ago

Here's a scenario where everything works except the ray dashboard:

jmakov commented 3 years ago

As discussed with @Yard1 on a call, when commenting out reuse_actors, it works as expected.

jmakov commented 3 years ago

After initial success (it ran great for about 15min), less and less CPUs on the cluster are used until the cluster has nothing to do and Tune hangs. ray monitor shows that all of the cluster's CPUs are in use. After another run (after cluster down/stop/up), it again starts processes on the whole cluster but hangs not after 15min but after a couple of seconds. The same behavior if I leave out the search_alg argument.

krfricke commented 3 years ago

Interesting, I'm wondering if this is actually due to the PB2 scheduler.

Can you share a full reproducible script (including a (fake) trainable, i.e. get_signal and OM_process_tune) and maybe the cluster config? Is this running on AWS?

Which Ray version are you running?

Yard1 commented 3 years ago

@krfricke We have removed the scheduler and it didn't impact anything. The Ray version is 1.6. @jmakov can share more information

jmakov commented 3 years ago

@krfricke It's a local cluster started with ray up --no-config-cache cluster.yaml. Don't have currently access to AWS or Google cloud. What currently works is using ConcurrencyLimiter and commenting out reuse_actors=True.

cluster.yaml:

cluster_name: default
provider:
  type: local
  head_ip: 192.168.0.101
  worker_ips:
    - 192.168.0.100
    - 192.168.0.102
auth:
  ssh_user: toaster
min_workers: 2
max_workers: 2
upscaling_speed: 1.0
idle_timeout_minutes: 5
file_mounts: {
      "~/workspace_ray_cluster": "~/workspace/puma/src/puma_lab",
}
cluster_synced_files: []
file_mounts_sync_continuously: False
rsync_exclude:
  - "**/.git"
  - "**/.git/**"
rsync_filter:
  - ".gitignore"
initialization_commands: []
setup_commands:
  - conda env create -q -n puma-lab -f ~/workspace_ray_cluster/environment.yaml || conda env update -q -n puma-lab -f ~/workspace_ray_cluster/environment.yaml
head_setup_commands: []
worker_setup_commands: []
head_start_ray_commands:
  - conda activate puma-lab && ray stop
  - conda activate puma-lab && ulimit -c unlimited && ray start --head --port=6379 --autoscaling-config=~/ray_bootstrap_config.yaml
worker_start_ray_commands:
  - conda activate puma-lab && ray stop
  - conda activate puma-lab && ray start --address=$RAY_HEAD_IP:6379

environment.yaml:

name: puma-lab
channels:
  - pyviz
  - conda-forge
  - defaults
dependencies:
  - _libgcc_mutex=0.1=conda_forge
  - _openmp_mutex=4.5=1_gnu
  - abseil-cpp=20210324.2=h9c3ff4c_0
  - alembic=1.7.3=pyhd8ed1ab_0
  - alsa-lib=1.2.3=h516909a_0
  - anyio=3.3.0=py37h89c1867_0
  - argcomplete=1.12.3=pyhd8ed1ab_2
  - argon2-cffi=20.1.0=py37h5e8e339_2
  - arrow-cpp=5.0.0=py37hdf48254_5_cpu
  - async_generator=1.10=py_0
  - attrs=21.2.0=pyhd8ed1ab_0
  - autopage=0.4.0=pyhd8ed1ab_0
  - aws-c-cal=0.5.11=h95a6274_0
  - aws-c-common=0.6.2=h7f98852_0
  - aws-c-event-stream=0.2.7=h3541f99_13
  - aws-c-io=0.10.5=hfb6a706_0
  - aws-checksums=0.1.11=ha31a3da_7
  - aws-sdk-cpp=1.8.186=hb4091e7_3
  - babel=2.9.1=pyh44b312d_0
  - backcall=0.2.0=pyh9f0ad1d_0
  - backports=1.0=py_2
  - backports.functools_lru_cache=1.6.4=pyhd8ed1ab_0
  - backports.zoneinfo=0.2.1=py37h5e8e339_4
  - bleach=4.1.0=pyhd8ed1ab_0
  - bokeh=2.3.3=py37h89c1867_0
  - brotlipy=0.7.0=py37h5e8e339_1001
  - bzip2=1.0.8=h7f98852_4
  - c-ares=1.17.2=h7f98852_0
  - ca-certificates=2021.5.30=ha878542_0
  - certifi=2021.5.30=py37h89c1867_0
  - cffi=1.14.6=py37hc58025e_0
  - chardet=4.0.0=py37h89c1867_1
  - charset-normalizer=2.0.0=pyhd8ed1ab_0
  - click=8.0.1=py37h89c1867_0
  - clickhouse-cityhash=1.0.2.3=py37h3340039_2
  - clickhouse-driver=0.2.1=py37h5e8e339_0
  - cliff=3.9.0=pyhd8ed1ab_0
  - cloudpickle=2.0.0=pyhd8ed1ab_0
  - cmaes=0.8.2=pyh44b312d_0
  - cmd2=2.2.0=py37h89c1867_0
  - colorama=0.4.4=pyh9f0ad1d_0
  - colorcet=2.0.6=pyhd8ed1ab_0
  - colorlog=6.4.1=py37h89c1867_0
  - conda=4.10.3=py37h89c1867_1
  - conda-package-handling=1.7.3=py37h5e8e339_0
  - cramjam=2.3.1=py37h5e8e339_1
  - cryptography=3.4.7=py37h5d9358c_0
  - cycler=0.10.0=py_2
  - cytoolz=0.11.0=py37h5e8e339_3
  - dask=2021.9.0=pyhd8ed1ab_0
  - dask-core=2021.9.0=pyhd8ed1ab_0
  - datashader=0.13.0=pyh6c4a22f_0
  - datashape=0.5.4=py_1
  - dbus=1.13.6=h48d8840_2
  - debugpy=1.4.1=py37hcd2ae1e_0
  - decorator=5.1.0=pyhd8ed1ab_0
  - defusedxml=0.7.1=pyhd8ed1ab_0
  - distributed=2021.9.0=py37h89c1867_0
  - entrypoints=0.3=py37hc8dfbb8_1002
  - expat=2.4.1=h9c3ff4c_0
  - fastparquet=0.7.1=py37hb1e94ed_0
  - filelock=3.0.12=pyh9f0ad1d_0
  - fontconfig=2.13.1=hba837de_1005
  - freetype=2.10.4=h0708190_1
  - fsspec=2021.8.1=pyhd8ed1ab_0
  - gettext=0.19.8.1=h0b5b191_1005
  - gflags=2.2.2=he1b5a44_1004
  - gitdb=4.0.7=pyhd8ed1ab_0
  - gitpython=3.1.23=pyhd8ed1ab_1
  - glib=2.68.4=h9c3ff4c_0
  - glib-tools=2.68.4=h9c3ff4c_0
  - glog=0.5.0=h48cff8f_0
  - greenlet=1.1.1=py37hcd2ae1e_0
  - grpc-cpp=1.40.0=h850795e_0
  - gst-plugins-base=1.18.5=hf529b03_0
  - gstreamer=1.18.5=h76c114f_0
  - heapdict=1.0.1=py_0
  - holoviews=1.14.5=py_0
  - hvplot=0.7.3=py_0
  - icu=68.1=h58526e2_0
  - idna=3.1=pyhd3deb0d_0
  - importlib-metadata=4.8.1=py37h89c1867_0
  - importlib_metadata=4.8.1=hd8ed1ab_0
  - importlib_resources=5.2.2=pyhd8ed1ab_0
  - ipykernel=6.4.1=py37h6531663_0
  - ipympl=0.7.0=pyhd8ed1ab_0
  - ipython=7.27.0=py37h6531663_0
  - ipython_genutils=0.2.0=py_1
  - ipywidgets=7.6.5=pyhd8ed1ab_0
  - jbig=2.1=h7f98852_2003
  - jedi=0.18.0=py37h89c1867_2
  - jinja2=3.0.1=pyhd8ed1ab_0
  - joblib=1.0.1=pyhd8ed1ab_0
  - jpeg=9d=h36c2ea0_0
  - json5=0.9.5=pyh9f0ad1d_0
  - jsonschema=3.2.0=py37hc8dfbb8_1
  - jupyter-server-mathjax=0.2.3=pyhd8ed1ab_0
  - jupyter_client=7.0.2=pyhd8ed1ab_0
  - jupyter_contrib_core=0.3.3=py_2
  - jupyter_contrib_nbextensions=0.5.1=py37hc8dfbb8_1
  - jupyter_core=4.7.1=py37h89c1867_0
  - jupyter_highlight_selected_word=0.2.0=py37h89c1867_1002
  - jupyter_latex_envs=1.4.6=py37h89c1867_1001
  - jupyter_nbextensions_configurator=0.4.1=py37h89c1867_2
  - jupyter_server=1.11.0=pyhd8ed1ab_0
  - jupyterlab=3.1.11=pyhd8ed1ab_0
  - jupyterlab-git=0.32.2=pyhd8ed1ab_0
  - jupyterlab_pygments=0.1.2=pyh9f0ad1d_0
  - jupyterlab_server=2.8.1=pyhd8ed1ab_0
  - jupyterlab_widgets=1.0.2=pyhd8ed1ab_0
  - kiwisolver=1.3.2=py37h2527ec5_0
  - krb5=1.19.2=hcc1bbae_0
  - lcms2=2.12=hddcbb42_0
  - ld_impl_linux-64=2.36.1=hea4e1c9_2
  - lerc=2.2.1=h9c3ff4c_0
  - libarchive=3.5.2=hccf745f_0
  - libblas=3.9.0=11_linux64_openblas
  - libbrotlicommon=1.0.9=h7f98852_5
  - libbrotlidec=1.0.9=h7f98852_5
  - libbrotlienc=1.0.9=h7f98852_5
  - libcblas=3.9.0=11_linux64_openblas
  - libclang=11.1.0=default_ha53f305_1
  - libcurl=7.78.0=h2574ce0_0
  - libdeflate=1.7=h7f98852_5
  - libedit=3.1.20191231=he28a2e2_2
  - libev=4.33=h516909a_1
  - libevent=2.1.10=hcdb4288_3
  - libffi=3.3=h58526e2_2
  - libgcc-ng=11.1.0=hc902ee8_8
  - libgfortran-ng=11.1.0=h69a702a_8
  - libgfortran5=11.1.0=h6c583b3_8
  - libglib=2.68.4=h3e27bee_0
  - libgomp=11.1.0=hc902ee8_8
  - libiconv=1.16=h516909a_0
  - liblapack=3.9.0=11_linux64_openblas
  - libllvm11=11.1.0=hf817b99_2
  - libnghttp2=1.43.0=h812cca2_0
  - libogg=1.3.4=h7f98852_1
  - libopenblas=0.3.17=pthreads_h8fe5266_1
  - libopus=1.3.1=h7f98852_1
  - libpng=1.6.37=h21135ba_2
  - libpq=13.3=hd57d9b9_0
  - libprotobuf=3.16.0=h780b84a_0
  - libsodium=1.0.18=h36c2ea0_1
  - libsolv=0.7.19=h780b84a_5
  - libssh2=1.10.0=ha56f1ee_0
  - libstdcxx-ng=11.1.0=h56837e0_8
  - libta-lib=0.4.0=h516909a_0
  - libthrift=0.14.2=he6d91bd_1
  - libtiff=4.3.0=hf544144_1
  - libutf8proc=2.6.1=h7f98852_0
  - libuuid=2.32.1=h7f98852_1000
  - libuv=1.42.0=h7f98852_0
  - libvorbis=1.3.7=h9c3ff4c_0
  - libwebp-base=1.2.1=h7f98852_0
  - libxcb=1.13=h7f98852_1003
  - libxkbcommon=1.0.3=he3ba5ed_0
  - libxml2=2.9.12=h72842e0_0
  - libxslt=1.1.33=h15afd5d_2
  - llvmlite=0.37.0=py37h9d7f4d0_0
  - locket=0.2.0=py_2
  - lxml=4.6.3=py37h77fd288_0
  - lz4-c=1.9.3=h9c3ff4c_1
  - lzo=2.10=h516909a_1000
  - mako=1.1.5=pyhd8ed1ab_0
  - mamba=0.15.3=py37h7f483ca_0
  - markdown=3.3.4=pyhd8ed1ab_0
  - markupsafe=2.0.1=py37h5e8e339_0
  - matplotlib=3.4.3=py37h89c1867_0
  - matplotlib-base=3.4.3=py37h1058ff1_0
  - matplotlib-inline=0.1.3=pyhd8ed1ab_0
  - mistune=0.8.4=py37h5e8e339_1004
  - modin-core=0.10.2=py37h89c1867_1
  - modin-ray=0.10.2=py37h89c1867_1
  - msgpack-python=1.0.2=py37h2527ec5_1
  - multipledispatch=0.6.0=py_0
  - mysql-common=8.0.25=ha770c72_2
  - mysql-libs=8.0.25=hfa10184_2
  - nb_conda_kernels=2.3.1=py37h89c1867_0
  - nbclassic=0.3.1=pyhd8ed1ab_1
  - nbclient=0.5.4=pyhd8ed1ab_0
  - nbconvert=6.1.0=py37h89c1867_0
  - nbdime=3.1.0=pyhd8ed1ab_0
  - nbformat=5.1.3=pyhd8ed1ab_0
  - ncurses=6.2=h58526e2_4
  - nest-asyncio=1.5.1=pyhd8ed1ab_0
  - nodejs=16.6.1=h92b4a50_0
  - notebook=6.4.3=pyha770c72_0
  - nspr=4.30=h9c3ff4c_0
  - nss=3.69=hb5efdd6_0
  - numba=0.54.0=py37h2d894fd_0
  - numpy=1.20.3=py37h038b26d_1
  - olefile=0.46=pyh9f0ad1d_1
  - openjpeg=2.4.0=hb52868f_1
  - openssl=1.1.1l=h7f98852_0
  - optuna=2.9.1=pyhd8ed1ab_0
  - orc=1.6.10=h58a87f1_0
  - packaging=21.0=pyhd8ed1ab_0
  - pandas=1.3.2=py37he8f5f7f_0
  - pandoc=2.14.2=h7f98852_0
  - pandocfilters=1.4.2=py_1
  - panel=0.12.1=py_0
  - param=1.11.1=pyh6c4a22f_0
  - parquet-cpp=1.5.1=1
  - parso=0.8.2=pyhd8ed1ab_0
  - partd=1.2.0=pyhd8ed1ab_0
  - pbr=5.6.0=pyhd8ed1ab_0
  - pcre=8.45=h9c3ff4c_0
  - pexpect=4.8.0=py37hc8dfbb8_1
  - pickle5=0.0.11=py37h5e8e339_0
  - pickleshare=0.7.5=py37hc8dfbb8_1002
  - pillow=8.3.2=py37h0f21c89_0
  - pip=21.2.4=pyhd8ed1ab_0
  - prettytable=2.2.0=pyhd8ed1ab_0
  - prometheus_client=0.11.0=pyhd8ed1ab_0
  - prompt-toolkit=3.0.20=pyha770c72_0
  - psutil=5.8.0=py37h5e8e339_1
  - pthread-stubs=0.4=h36c2ea0_1001
  - ptyprocess=0.7.0=pyhd3deb0d_0
  - pyarrow=5.0.0=py37h58331f5_5_cpu
  - pycosat=0.6.3=py37h5e8e339_1006
  - pycparser=2.20=pyh9f0ad1d_2
  - pyct=0.4.6=py_0
  - pyct-core=0.4.6=py_0
  - pygments=2.10.0=pyhd8ed1ab_0
  - pykalman=0.9.5=py_1
  - pyopenssl=20.0.1=pyhd8ed1ab_0
  - pyparsing=2.4.7=pyh9f0ad1d_0
  - pyperclip=1.8.2=pyhd8ed1ab_2
  - pyqt=5.12.3=py37h89c1867_7
  - pyqt-impl=5.12.3=py37he336c9b_7
  - pyqt5-sip=4.19.18=py37hcd2ae1e_7
  - pyqtchart=5.12=py37he336c9b_7
  - pyqtwebengine=5.12.1=py37he336c9b_7
  - pyrsistent=0.17.3=py37h5e8e339_2
  - pysocks=1.7.1=py37h89c1867_3
  - python=3.7.10=hffdb5ce_100_cpython
  - python-dateutil=2.8.2=pyhd8ed1ab_0
  - python_abi=3.7=2_cp37m
  - pytz=2021.1=pyhd8ed1ab_0
  - pyviz_comms=2.1.0=py_0
  - pyyaml=5.4.1=py37h5e8e339_1
  - pyzmq=22.2.1=py37h336d617_0
  - qt=5.12.9=hda022c4_4
  - ray-core=1.6.0=py37hf931bba_0
  - ray-tune=1.6.0=py37h89c1867_0
  - re2=2021.09.01=h9c3ff4c_0
  - readline=8.1=h46c0cb4_0
  - redis-py=3.5.3=pyh9f0ad1d_0
  - reproc=14.2.3=h7f98852_0
  - reproc-cpp=14.2.3=h9c3ff4c_0
  - requests=2.26.0=pyhd8ed1ab_0
  - requests-unixsocket=0.2.0=py_0
  - ruamel_yaml=0.15.80=py37h5e8e339_1004
  - s2n=1.0.10=h9b69904_0
  - scikit-learn=0.24.2=py37hf0f1638_1
  - send2trash=1.8.0=pyhd8ed1ab_0
  - setproctitle=1.1.10=py37h5e8e339_1004
  - setuptools=58.0.4=py37h89c1867_0
  - six=1.16.0=pyh6c4a22f_0
  - smmap=3.0.5=pyh44b312d_0
  - snappy=1.1.8=he1b5a44_3
  - sniffio=1.2.0=py37h89c1867_1
  - sortedcontainers=2.4.0=pyhd8ed1ab_0
  - sqlalchemy=1.4.25=py37h5e8e339_0
  - sqlite=3.36.0=h9cd32fc_1
  - stevedore=3.4.0=py37h89c1867_0
  - ta-lib=0.4.19=py37ha21ca33_2
  - tabulate=0.8.9=pyhd8ed1ab_0
  - tblib=1.7.0=pyhd8ed1ab_0
  - tensorboardx=2.4=pyhd8ed1ab_0
  - terminado=0.12.1=py37h89c1867_0
  - testpath=0.5.0=pyhd8ed1ab_0
  - threadpoolctl=2.2.0=pyh8a188c0_0
  - thrift=0.13.0=py37hcd2ae1e_2
  - tk=8.6.11=h27826a3_1
  - toolz=0.11.1=py_0
  - tornado=6.1=py37h5e8e339_1
  - tqdm=4.62.2=pyhd8ed1ab_0
  - traitlets=5.1.0=pyhd8ed1ab_0
  - typing_extensions=3.10.0.0=pyha770c72_0
  - tzdata=2021a=he74cb21_1
  - tzlocal=3.0=py37h89c1867_2
  - urllib3=1.26.6=pyhd8ed1ab_0
  - wcwidth=0.2.5=pyh9f0ad1d_2
  - webencodings=0.5.1=py_1
  - websocket-client=0.57.0=py37h89c1867_4
  - wheel=0.37.0=pyhd8ed1ab_1
  - widgetsnbextension=3.5.1=py37h89c1867_4
  - xarray=0.19.0=pyhd8ed1ab_1
  - xeus=2.0.0=h7d0c39e_0
  - xeus-python=0.13.0=py37h4b46df4_1
  - xeus-python-shell=0.1.5=pyhd8ed1ab_0
  - xorg-libxau=1.0.9=h7f98852_0
  - xorg-libxdmcp=1.1.3=h7f98852_0
  - xz=5.2.5=h516909a_1
  - yaml=0.2.5=h516909a_0
  - zeromq=4.3.4=h9c3ff4c_1
  - zict=2.0.0=py_0
  - zipp=3.5.0=pyhd8ed1ab_0
  - zlib=1.2.11=h516909a_1010
  - zstandard=0.15.2=py37h5e8e339_0
  - zstd=1.5.0=ha95c52a_0
  - pip:
    - absl-py==0.13.0
    - aiohttp==3.7.4.post0
    - aiohttp-cors==0.7.0
    - aioredis==1.3.1
    - async-timeout==3.0.1
    - autograd==1.3
    - bayesian-optimization==1.2.0
    - blessings==1.7
    - cachetools==4.2.2
    - cma==2.7.0
    - colorful==0.5.4
    - cython==0.29.24
    - future==0.18.2
    - google-api-core==1.31.2
    - google-auth==1.35.0
    - google-auth-oauthlib==0.4.6
    - googleapis-common-protos==1.53.0
    - gpustat==0.6.0
    - gpy==1.10.0
    - gpytorch==1.5.1
    - grpcio==1.40.0
    - hebo==0.1.0
    - hiredis==2.0.0
    - multidict==5.1.0
    - nevergrad==0.4.3.post8
    - nvidia-ml-py3==7.352.0
    - oauthlib==3.1.1
    - opencensus==0.7.13
    - opencensus-context==0.1.2
    - paramz==0.9.5
    - protobuf==3.17.3
    - py-spy==0.3.9
    - pyasn1==0.4.8
    - pyasn1-modules==0.2.8
    - pymoo==0.4.2.2
    - ray==1.6.0
    - requests-oauthlib==1.3.0
    - rsa==4.7.2
    - scipy==1.5.4
    - sklearn==0.0
    - tensorboard==2.6.0
    - tensorboard-data-server==0.6.1
    - tensorboard-plugin-wit==1.8.0
    - torch==1.9.1
    - werkzeug==2.0.1
    - yarl==1.6.3
jmakov commented 3 years ago

@krfricke @Yard1 I think I found a part of the problem at least. This example hangs with or without reuse_actors. Looks like a modin-ray interoperability issue:

import modin.pandas as pd 
import ray
from ray import tune
from ray.tune.suggest.basic_variant import BasicVariantGenerator

ray.init(address='auto', _redis_password='xxx')

def easy_objective(config, data):
    data_df = data[0]

    # Here be dragons. If either of the below lines are included, Tune hangs.
    score = int(pd.DataFrame(pd.Series(df.test), columns=["test"]).explode(["test"]).test.sum()) 
    # pd.DataFrame(pd.Series(df.test), columns=["test"]).explode(["test"])
    # pd.DataFrame(pd.Series(df.test), columns=["test"]).sum()  

    tune.report(score=score)

tune.run(
    tune.with_parameters(easy_objective, data=[df.index.values, df.bid.values, df.ask.values, df.decimals_price[0]]),
    name="test_study",
    time_budget_s=3600*24*3,
    num_samples=-1,
    verbose=3,
    fail_fast=True,
    config={
            "steps": 100,
            "width": tune.uniform(0, 20),
            "height": tune.uniform(-100, 100),
            "activation": tune.grid_search(["relu", "tanh"])
        },
    metric="score", 
    mode="max",
# but works with this enabled
#    search_alg=BasicVariantGenerator(max_concurrent=CLUSTER_AVAILABLE_LOGICAL_CPUS - 1),  #N.B. "-1", else hangs
)

I also vote we rename the project from ray to dragons_everywhere :P.

Yard1 commented 3 years ago

Yeah, looks like Tune is taking up all CPU resources, making modin operations inside the trainable deadlocked. This is also why limiting concurrency fixes the issue, as if frees up enough CPUs for modin to work.

jmakov commented 3 years ago

Opened an issue on the modin project: https://github.com/modin-project/modin/issues/3479. Not sure when they will respond but if it takes a couple of days perhaps we can just update ray docs?

jmakov commented 3 years ago

Upon discussing the issue further with @Yard1, what also works is resources_per_trial={"cpu":0,"extra_cpu":1}. In this case though ray monitor reports 3 to 5 CPUs in use in the whole cluster (52 avail.) but almost all CPUs are working. Another observation with resources_per_trial solution: node1 has 12 CPUs, load is 20, node2 8 CPUs, load 20, node3 (head node) 32 CPUs, load 18. Is the head node intentionally underutilized or is ray just equally distributing workload among nodes?

Using ConcurrencyLimiter(..., max_concurrent=AVAIL_CPUS_ON_CLUSTER - 2) the workload is better handled: node1: 13, node2: 8, node3: 31.