Closed Alxe1 closed 2 years ago
@Alxe1 How long does it hang or do you wait till you killed the program? Did you get a chance to see the ray dashboard? Wonder if any work is actually being done. I notice that from the console, there is not even one training result reported back yet. So wonder if anything is wrong at the tensorflow layer.
@Alxe1 How long does it hang or do you wait till you killed the program? Did you get a chance to see the ray dashboard? Wonder if any work is actually being done. I notice that from the console, there is not even one training result reported back yet. So wonder if anything is wrong at the tensorflow layer.
Yes, it hang a long time, and I use a small dataset to test it, it also hang a long time and can not stop too. But when I put the small dataset in train_func like:
def train_func(config):
# -------------------------PUT DATASET HERE--------------------------
config, dataset = preprocessing_data(PARQUET_PATH)
config.update({"user_item_dim": 32, "feature_embed_dim": 16, "embed_norm": 0.001, "hidden_units": [64, 32, 32]})
# ------------------------------------------------------------------------
batch_size = config.get("batch_size", 1024)
epochs = config.get("epochs", 3)
strategy = tf.distribute.MultiWorkerMirroredStrategy()
It works! It's so weired. But when I increase the dataset to bigger, it also hang a long time. The resources are:
======== Autoscaler status: 2022-08-26 12:04:02.317127 ========
Node status
---------------------------------------------------------------
Healthy:
1 node_2fa56f09e075d36c8130b048f6e84530c73dba93419c43c9590ef108
1 node_9e5595a0042e88dc330c96f706c3ab99e52229a4a4cdbda44515369b
1 node_c8ba399fb7cf0e5c4fa17d5a9122b76e017e5e2ad13a4d3feed3c07d
Pending:
(no pending nodes)
Recent failures:
(no failures)
Resources
---------------------------------------------------------------
Usage:
9.0/16.0 CPU (9.0 used of 9.0 reserved in placement groups)
0.00/50.793 GiB memory
0.09/30.000 GiB object_store_memory
Demands:
(no resource demands)
And the dashboard:
Thank you so much! Could you also share me with the parquet file so that I can run it and debug?
@Alxe1 are you still running into this issue? If so can you provide a repro?
@Alxe1 are you still running into this issue? If so can you provide a repro?
It's not appeared so far when I restart Ray cluster. But the processes of ray::IDE
sometimes still are there when the program is done no matter what I run. #28199
Awesome! will close for now.
What happened + What you expected to happen
I trained a model using ray air, but it is always running and don't stop:
Versions / Dependencies
ray 2.0.0 tensorflow 2.8.0 python 3.7.10
Reproduction script
Issue Severity
No response