ludwig-ai / ludwig

Low-code framework for building custom LLMs, neural networks, and other AI models
http://ludwig.ai
Apache License 2.0
11.08k stars 1.19k forks source link

[ray] Fix dynamic resource allocation with RayDatasets #1441

Open tgaddair opened 2 years ago

tgaddair commented 2 years ago

Seems there is a deadlock that arises when using Ray Tune with dynamic resource allocation + RayDatasets. When we set cache_format = 'parquet', everything works fine, but when we use the new default cache_format = 'ray', trials will hang, presumably because RayDatasets are locking out some of the resources needed by dynamic allocation.

Even if we bump up the number of resources in the cluster, we end up in the same place:

2021-10-29 12:55:53,822 WARNING worker.py:1227 -- The actor or task with ID eae16da7b2a06f615e017e6d60f6e1ebc6be08064045c345 cannot be scheduled right now. You can ignore this message if this Ray cluster is expected to auto-scale or if you specified a runtime_env for this actor or task, which may take time to install.  Otherwise, this is likely due to all cluster resources being claimed by actors. To resolve the issue, consider creating fewer actors or increasing the resources available to this Ray cluster.
Required resources for this actor or task: {CPU_group_098d048e4d6df88103b3ad1b5a1e6f44: 1.000000}
Available resources on this node: {5.998000/8.000000 CPU, 292485120.019531 GiB/292485120.019531 GiB memory, 7680000.000000 GiB/7680000.000000 GiB object_store_memory, 1.000000/1.000000 CPU_group_1_f26449396dc25d88645e711373b91bd4, 1000.000000/1000.000000 bundle_group_1_098d048e4d6df88103b3ad1b5a1e6f44, 0.000000/0.001000 CPU_group_0_098d048e4d6df88103b3ad1b5a1e6f44, 1.000000/1.000000 CPU_group_1_098d048e4d6df88103b3ad1b5a1e6f44, 2000.000000/2000.000000 bundle_group_098d048e4d6df88103b3ad1b5a1e6f44, 1000.000000/1000.000000 bundle_group_0_098d048e4d6df88103b3ad1b5a1e6f44, 1000.000000/1000.000000 bundle_group_0_f26449396dc25d88645e711373b91bd4, 0.000000/1.001000 CPU_group_098d048e4d6df88103b3ad1b5a1e6f44, 0.000000/0.001000 CPU_group_0_f26449396dc25d88645e711373b91bd4, 1.000000/1.000000 node:192.168.4.54, 2000.000000/2000.000000 bundle_group_f26449396dc25d88645e711373b91bd4, 0.000000/1.001000 CPU_group_f26449396dc25d88645e711373b91bd4, 1000.000000/1000.000000 bundle_group_1_f26449396dc25d88645e711373b91bd4}
In total there are 6 pending tasks and 0 pending actors on this node.
== Status ==
Memory usage on this node: 9.8/16.0 GiB
Using FIFO scheduling algorithm.
Resources requested: 2.002/8 CPUs, 0/0 GPUs, 0.0/5.58 GiB heap, 0.0/0.15 GiB objects
Result logdir: /tmp/mock-client-8fb5/trainable_func_fiYzhHE
Number of trials: 2/2 (2 RUNNING)
+-------------------+----------+-------+------------------------+------------------------------+--------------------------+
| Trial name        | status   | loc   |   binary_46001.fc_size |   binary_46001.num_fc_layers |   training.learning_rate |
|-------------------+----------+-------+------------------------+------------------------------+--------------------------|
| trial_23bf9_00000 | RUNNING  |       |                    124 |                            4 |               0.00561152 |
| trial_23bf9_00001 | RUNNING  |       |                    220 |                            2 |               0.0291064  |
+-------------------+----------+-------+------------------------+------------------------------+--------------------------+
tgaddair commented 2 years ago

cc @clarkzinzow

Yard1 commented 2 years ago

Hey @tgaddair a workaround would be to use max_concurrent_trials argument in tune.run. This ensures that no more than N trials will be scheduled at a time. If you set it so that some resources are unclaimed by trials, Ray Data workers will be able to run. In case of a dynamic resource allocation, the function used for resources will need to be modified to keep some CPUs free (should be trivial).

We'll be looking into making a proper fix for this.