tapis-project / tapis-jobs

Texas Advanced Computing Center APIs
BSD 3-Clause "New" or "Revised" License
1 stars 1 forks source link

Support "mulit-step" applications #20

Open joestubbs opened 2 years ago

joestubbs commented 2 years ago

Some research workloads are composed of multiple, individual steps submitted as a single HPC batch job. In the simplest cases, the entry point to the job is a parent script that starts up the individual steps as threads or processes. These individual steps could run sequentially or in parallel, on the same compute node or across additional nodes.

For various reasons, it can be more computationally efficient to submit the workload as a single batch job -- for example, if the individual steps will share memory or files that need to be staged to/from the compute environment. Additionally, some individual steps do not make sense to run "standalone" or without having executed previous steps. For these reasons, it would be ideally for the "multi-step" application to be registered as a single Tapis application.

There are many tools that support developers building and executing multi-step applications. Here are just a few:

  1. Launcher (and GPU Launcher) [ref] is a tool developed at TACC that allows users to execute a list of tasks (command lines) across 1 or more nodes.
  2. CWL [ref] is a workflow specification language with implementations that support executing workflows of command-line tools.
  3. Nextflow [ref] is a framework for building data-driven computational pipelines.

There are at least two challenges for making applications like the above work with Tapis apps.

  1. Programs like launcher need to be executed directly on the compute node. This is similar to MPI launchers like mpirun. In Launcher's case in particular, a single input text file (the task file) should be supplied. The task file contains the list of command lines to execute across the resources in the job. It is not uncommon for applications to dynamically generate the task file for a particular job. For example, a task file might be generated based on the number of input files to the job (e.g., one task line for each input file). It's also not uncommon for preprocessing to occur before executing launcher -- for example, to split large input files into smaller files -- and to run post-processing after it completes.
  2. It is not clear whether it is possible to execute Singularity containers from within Singularity containers on TACC HPC machines. One approach to handling some multi-step applications could be to write a driver script that is responsible for launching the individual steps and wrap that into a container that includes the Singularity binary plus Singularity image files for the individual steps. This "driver image" would become the Tapis application image, and when executed, it would start the driver script which could launch the individual steps as standalone containers (e.g., using "singularity start"). While this is possible to do in general, we have not been able to make it work on TACC HPC machines.

Even if executing containers from within running containers can be made to work, it is not clear how that approach would help with cases like Launcher.

schristley commented 1 year ago

While I believe it's possible to use docker within docker, because docker runs with root privileges, it seems that singularity purposefully disallows this by design.

schristley commented 1 year ago

Another idea which doesn't require implementing a workflow language is to let the user run consecutive jobs that share the same scratch directory. That is, run a job, then leave the scratch directory and all of its files in place, when running a second job, specify the same scratch directory so the second job has access to all the files left over from the first job, and so forth.

I haven't looked deeply enough into v3/jobs to see if this is currently possible. I suppose the user could specify the scratch directory to be used instead of it being automatically determined by Tapis.

joestubbs commented 11 months ago

@richcar58 : Can you add a pointer here to the zip runtime proposal/implementation documentation?

richcar58 commented 11 months ago

Yes, it's possible to arrange for the output files of one Tapis v3 job to be used as input into one or more subsequent jobs. The trick will be settling on a job launch strategy--when will jobs later in a workflow's execution get launched? Those latter jobs, for example, could get launched upon receipt of a Tapis notification that a previous job has completed. This orchestrator approach requires workflow knowledge to be externalized and really is a type of rudimentary workflow manager.

Another approach would be to bake into the jobs themselves a monitoring or polling capability. A simple monitoring approach would have jobs wait for particular files to appear (or disappear, become unlocked, etc.) before proceeding. In this scenario, all jobs can be launched simultaneously and each would only start executing when its inputs became available.

richcar58 commented 11 months ago

Currently under development is support for a new Tapis runtime call the Zip Runtime. This support allows zip or tar.gz archive files to be treated as a type of "image" in application definitions and in the Jobs service. Jobs will stage the archive file, unpack it, execute a specified executable and monitor that executable until it reaches a terminal state. Input file staging and output file archiving work the same as in all other Tapis supported runtimes.

The idea is that users will have complete freedom to include whatever they need in their archives and can run whatever commands their host account permits. Certain conventions need to be observed to interact successfully with Tapis, but other than that workflows can be encapsulated in an archive. Executions can be reproducible by versioning and documenting the archive files.

schristley commented 11 months ago

The ZIP Runtime looks good and somewhat like V2 which took a directory tree from a storage system, zipped and versioned it when an app was published. If I understand the design, the app entry point will be a BASH script that is outside of a singularity container?

richcar58 commented 11 months ago

By default, _tapisjobapp.sh will be run and it typically would be a BASH script. Otherwise, the tapisjob.manifest would contain the pathname of the executable to be run (any executable will do).