Open martinrebane opened 4 years ago
PS! I do not know if this is relevant, but I am submitting the job with tmux option:
ray submit train_gcp.yaml ray_trainer.py --tmux
cc @ijrsvt @henktillman
cc @martinrebane did you end up resolving this?
@richardliaw I solved it by adding export AUTOSCALER_MAX_NUM_FAILURES=100;
into my startup yaml file before ray start
commands (I added it both for the head and workers). This works well for me, but for new users discovering the problem and solution might be rather frustrating.
So the error is still popping up, but because of max num of failures is high, it does not kill the job.
https://github.com/googleapis/google-api-python-client/issues/218 this seems related? I don't see an obvious workaround though :S
What is the problem?
I am running Ray cluster on Google Cloud Platform, using RaySGD for training my model. Every 30 minutes I see an error in
ray monitor [yaml]
log:Error during autoscaling
(stacktrace attached). This error does no harm and the cluster keeps working fine, nothing is scaled or killed. Finally, afterAUTOSCALER_MAX_NUM_FAILURES
attemps the error is followed byStandardAutoscaler: Too many errors, abort.
and all the worker nodes are killed.Ray version and other system information (Python version, TensorFlow version, OS): Ray 0.8.6, Ubuntu 18.04 on GCP, Pytorch 1.4, RaySGD TorchTrainer via remote() call
Reproduction (REQUIRED)
Please provide a script that can be run to reproduce the issue. The script should have no external library dependencies (i.e., use fake or mock data / environments): I am running my project using remote RaySGD trainer:
The error happens exactly every 30 minutes:
and then
The reoccurring
Error during autoscaling
itself does not seem to do any harm at all (all workers are doing their job), but killing of the workers of course is very problematic as new ones won't be created.Full stacktrace for the last autoscaling error and error that causes the abortion:
My cluster setup YAML: