ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.92k stars 5.77k forks source link

[Core] Temporary folder per worker process #45108

Open j-tr opened 6 months ago

j-tr commented 6 months ago

Description

A temporary directory that is created by ray for each worker process and deleted after the process is terminated. The path to this directory could be communicated to the application code via an env variable.

Use case

Many workloads make use of temporary directories and files which are usually cleaned up by the application logic after they are not needed anymore (e.g. using the python tempfile package). If the workload is terminated abruptly (e.g. if the ray memory monitor or the OS kills the underlying process) this cleanup cannot be performed and the temporary files stay around indefinitely. This can over time fill up the entire disk space and make the node unusable. It should not be the application developers' responsibility to run additional maintenance tasks that clean up files that were left behind by unexpectedly terminated tasks.

anyscalesam commented 6 months ago

@jjyao thoughts? and would this be a big change?

jjyao commented 6 months ago

Hi @j-tr you can specify the temp dir when you start Ray so your application can do the clean-up as you want.

j-tr commented 6 months ago

@jjyao could you please elaborate? Are you referring to the _temp_dir parameter of ray.init? This ticket aims to tackle the challenge of clearing temporary files when the application cannot do so itself due to termination by the operating system or memory monitor.

jjyao commented 6 months ago

@j-tr

Just want to make sure I understand your feature request: are you saying that Ray creates a unique temporary folder for each worker process and Ray cleans up it when the worker process exits?

Also can you tell us more why you want to use temporary folder?

j-tr commented 6 months ago

@jjyao exactly. a temporary folder per worker process, managed by ray. Temporary files are often part of our data processing pipeline, when streaming data directly from and to the cloud is not easily possible (e.g. download from cloud to temporary file, process data, store output in another tempfile, upload to cloud).