Open daviddavo opened 1 month ago
👏, this is really a feature in demand.
Although there is no official implementation right now. I find a work around for this issue.
First, add this timeout_decorator to your training function. NOTICING, you need pass additional reporter parameter to your training function. Since the ray.train.report works incorrectly in multiprocessing, thereby queue is utilized to pass the results.
from multiprocessing import Process, Value, Queue
def timeout_decorator(timeout):
def decorator(func):
def wrapper(*args, **kwargs):
stop_flag = Value('b', False)
def target():
try:
kwargs['reporter'] = queue
func(*args, **kwargs)
except Exception as e:
print(f"Function raised an exception: {e}")
queue = Queue()
process = Process(target=target)
process.start()
process.join(timeout)
if process.is_alive():
print("Function timed out! Terminating process and report 0.")
stop_flag.value = True
process.terminate()
process.join()
ray_service.report({'accuracy': 0})
while not queue.empty():
ray_service.report(queue.get())
return wrapper
return decorator
Second, set this at the top of the script. NOTICING, do not initialize torch in your main thread. Otherwise, it will create CUDA re-initialization error.
from multiprocessing import set_start_method
set_start_method('spawn', force=True)
Finally, replace ray.train.report in your training function with queue operation.
reporter.put(results_dict)
I hope this is helpful😃.
Description
I know that you can use a TrialStopper to stop a trial when it reaches some condition (when the loss reaches a plateu, or the number of iterations is too large). But sometimes, due to an error, the trial might hang. Even if you specify that you want to stop the trial after a certain number of seconds, it won't because the function never returns. You should be able to specify a "hard timeout" for trials where the scheduler would kill the process.
Note: I know that you can use signals or other process, but using other process would multiply the number of processes per trial, and using signals is OS-dependant.
Use case
Better resource usage. Don't leave my program running all night when a trial has entered an infinite loop or something.