ireceptor-plus / issues

0 stars 0 forks source link

commonality between TACC systems and CC systems #84

Closed schristley closed 1 year ago

schristley commented 3 years ago

reproducibility requires a common environment at some point, but as you try to define it, you find it is turtles all the way down

@bcorrie As you mentioned the other day, what's in "common" wasn't necessarily common, at least when it came to the Cedar system. One main issue I know is TACC has a tool called launcher that is a simple parallelism tool, you put commands in a text file, one line each, and it runs them all in parallel. VDJServer uses that extensively. Do you have a similar tool available on Cedar?

bcorrie commented 3 years ago

@schristley not that I am aware of, but there might be. CC seems to suggest using SLURM's job arrays for something like that (https://docs.computecanada.ca/wiki/Running_jobs#Array_job)

schristley commented 3 years ago

@schristley not that I am aware of, but there might be. CC seems to suggest using SLURM's job arrays for something like that (https://docs.computecanada.ca/wiki/Running_jobs#Array_job)

Hmm, interesting, TACC is using SLURM but doesn't seem to support them. They might work. It has the disadvantage in that you need to know the number of jobs ahead of time. I also don't know how you would set that flag with Tapis...

launcher is open source, so in theory you could use it, it is small enough that you can put the whole package in the app assets. However, I'm not sure that I trust that launcher was written generically enough, wouldn't surprise me if there are TACC'isms in it somewhere.

schristley commented 3 years ago

What I've started doing is writing individual self-contained shell scripts to perform operations, they are designed to be run within singularity like this:

singularity exec ${singularity_image} bash mutational_analysis.sh ${germFilename} ${baseFilename}

Hopefully this makes them fairly self-contained, then at a higher-level script you can decide how you want to parallelize these across nodes and files.

bcorrie commented 3 years ago

This sounds good to me... I think for parallelism we probably want to keep it simple for now. I see two options:

  1. The singularity container handles parallelism internally (with multithreading) so you ask Tapis for N cores and then tell the container to use N cores, or
  2. The singularity container is single threaded, and you use an external, platform specific mechanism (launcher on TACC, array jobs on CC) for "embarrassingly parallel" jobs, asking Tapis for N cores and running N jobs.
bcorrie commented 3 years ago

@schristley wondering if we could agree on a consistent file name for the Tapis App JSON file for all Apps. Each App has a different directory (e.g. assign_clones). In that directory, there is a different directory for each version, and within that a different directory for each system. In this case, the main JSON file is assign-clones.json and the main shell script is assign-clones.sh.

Would it make sense for all Apps to have a consistently named JSON description file (e.g. app.json) and a consistent basic shell script (app.sh). In that way, for any iR+ App, we know that the two key files are app.json and app.sh. We could go further and rather than have a test directory with a test.sh in it, have a test.sh at the higher level.

This makes it easy to automatically look at the App directory and infer which apps exist and makes it clear which are the basic JSON and shell files that Tapis requires... That way we know we can always do a

tapis app create -f app.json

and know that within that file we would always have:

  "templatePath": "app.sh",
  "testPath": "test.sh",

All other files are staged, so app.sh can do what ever it wants, but this way we have consistency across all iR+ apps for all systems...

bcorrie commented 3 years ago

Another possible alternate protocol we could use is use the base app directory name as both the JSON file name and the shell file name. This is essentially what you have done, although not sure if this is explicit and consistent.

Either would be fine for me, but if we always new that one of these was true, then processing Apps would be easier...

bcorrie commented 3 years ago

I am currently using app.json and app.sh for all Apps on the Gateay, but would be easy to change...

schristley commented 3 years ago

Would it make sense for all Apps to have a consistently named JSON description file (e.g. app.json) and a consistent basic shell script (app.sh). In that way, for any iR+ App, we know that the two key files are app.json and app.sh.

That seems reasonable. I don't see any problem with that.

If you are thinking of parsing app.json, you might consider getting the JSON from the Tapis API instead. That way you get the JSON for the actual app (and any of its different published versions), for example:

tapis apps list
tapis apps show -v repcalc-stampede2-1.0u7
tapis apps show -v repcalc-stampede2-1.0u6

Note, that for production, the app should be published (tapis apps publish) so that the app is bundled and frozen. Calling the non-published apps (e.g. repcalc-stampede2-1.0) is fine for testing of course.

We could go further and rather than have a test directory with a test.sh in it, have a test.sh at the higher level.

I prefer to have the test directory, so that all test job scripts are kept separate from the main app files. I also tend to upload these test JSON as part of the app package, so they can be recovered if necessary, but if they are in the main directory then they are put in the main SCRATCH directory when the job is staged, which pollutes that directory with extra files. Some of my apps have a bunch of test JSON. But if you really don't intend to have those scripts, or have them somewhere else, then I don't see a problem.

As far as I know, Tapis doesn't implement the testing aspect of apps, which is why test.sh is always empty.

schristley commented 3 years ago

@bcorrie One thing I've been playing with is to make the singularity_image input have a default value and visible:false like below. Then when a job is submitted, it doesn't have to be provided as part of the job submission but is still copied as part of job staging.

    {
      "id": "singularity_image",
      "details": {
        "label": "",
        "description": "Singularity image file",
        "showAttribute": false
      },
      "semantics": {
        "minCardinality": 1,
        "maxCardinality": 1,
        "ontology": [
          "http://sswapmeet.sswap.info/mime/application/Json"
        ],
        "fileTypes": [
          "text-0"
        ]
      },
      "value": {
        "default": "agave://data.vdjserver.org//irplus/images/immcantation_suite-4.1.0.sif",
        "visible": false,
        "required": true
      }
    },
bcorrie commented 3 years ago

Cool... That is almost exactly what I did - except I use a parameter and have the App actually pull the image from the Gateway.

https://github.com/ireceptor-plus/tapis/blob/fb7bf23ca04941ed706b28da9ccc1c8b07e6ec4c/apps/vdjbase-singularity/0.1/gateway/app.sh#L60

I had horrible performance problems when I asked Tapis to do this staging as part of the inputs and as part of the App itself. I lodged an issue with TACC and they are looking into it, but no news back yet. So I decided to use a parameter and have the App pull the image.

I think your approach is the ideal one.

schristley commented 3 years ago

I had horrible performance problems when I asked Tapis to do this staging as part of the inputs and as part of the App itself. I lodged an issue with TACC and they are looking into it, but no news back yet. So I decided to use a parameter and have the App pull the image.

Hmm, I have sneaky suspicion that the copy is going through a Tapis server at TACC... You might want to specifically ask if that is the case, or maybe you already know? While the network backbone between TACC and CC should be super fast, we have seen those slow network issues in the past with the ADC stuff...

bcorrie commented 3 years ago

Yes, it appears that it is, and one of the connections is quite slow - so it looks like it is not going over the research network. I suggested to TACC that this might be the problem. No info back yet so sticking with my parameter solution for now.

schristley commented 1 year ago

will not be implemented, the two systems are different and Tapis V3 requires a re-think of apps