Open aaaditij opened 2 months ago
Hi @aaaditij. Thank you for creating this enhancement request. These options make sense to me as traitlet-configurable options. At the same time we are not working on any of them now and are not planning to as there are some high-priority deliverables in the pipeline.
Implementation overview for "1.) Add a boolean flag to download_files api to allow the user to specify if they only want the output files to be copied over to the output folder." based on discussion with @aaaditij:
side_effects : Optional[List[str]] = []
field to all Job-related data models https://github.com/jupyter-server/jupyter-scheduler/blob/main/jupyter_scheduler/models.pyside_effects
to store side effect files created during the job run instead of adding them to packaged_files
by changing DefaultExecutionManager.add_side_effects_files
accordingly https://github.com/jupyter-server/jupyter-scheduler/blob/1af9903a0671f2dfa570959bd097e6af492d9be0/jupyter_scheduler/executors.py#L147FilesDownloadHandler
to accept output files only option https://github.com/jupyter-server/jupyter-scheduler/blob/1af9903a0671f2dfa570959bd097e6af492d9be0/jupyter_scheduler/handlers.py#L397JobFilesManager
and Downloader
as parameters, add them as arguments to both and any intermediary classes https://github.com/jupyter-server/jupyter-scheduler/blob/1af9903a0671f2dfa570959bd097e6af492d9be0/jupyter_scheduler/job_files_manager.py#L27-L34Downloader.generate_filepaths
function to only return side effects and outputs and not packaged files if download_output_files_only
is set https://github.com/jupyter-server/jupyter-scheduler/blob/1af9903a0671f2dfa570959bd097e6af492d9be0/jupyter_scheduler/job_files_manager.py#L56-L70
Problem
For the createjob api, one of the inputs to this API is a boolean flag called package_input_folder, which when set to true, packages the input folder (the folder containing the input notebook) and all nested files and subfolders within it during the job creation. This introduces the following problems:
download_files
api copies the entire input folder from staging area to the output folder. This is currently done so that notebook downloaded with other output files would have access to all the same files as original and so that running notebook as a whole or some cells could be replicated if they refer to files via local paths. This in essence is copying the entire input folder twice, once to the staging area and then to the output folder and can quickly lead to storage exhaustion if the input folder is large.The files in the staging area are never cleaned up again eating up storage space.
Proposed solution
download_files
api to allow the user to specify if they only want the output files to be copied over to the output folder.download_files
api to delete all files belonging to an execution from the staging area after they have been copied over to the output folder.