Add support for prefixing calls to HTCondor binaries to the HTCondor runner

kysrpex commented 10 months ago

This PR is meant to be used for the HTCondor migration. It defines an optional parameter prefix for the HTCondor job runner that prepends the parameter value to all calls to HTCondor binaries, such as condor_submit, condor_rm or condor_ssh_to_job.

This enables having two HTCondor runners defined in the Galaxy job configuration file, with each one calling different executable files and thus allowing to route jobs to two different HTCondor clusters.

The shortcoming of this approach is that the contents of prefix are constrained to strings that, when prepended to the condor_* commands, still point to an executable (i.e. no shell commands). Thus, such executable files need to be created and placed somewhere. A minimal example is available here. This constraint exists because of how Galaxy calls the subprocess.Popen constructor (using shell=False).

How to test the changes?

(Select all options that apply)

[ ] I've included appropriate automated tests.
[ ] This is a refactoring of components with existing test coverage.
[X] Instructions for manual testing are as follows:
1. See this minimal example.

License

[x] I agree to license these and all my past contributions to the core galaxy codebase under the MIT license.

kysrpex commented 10 months ago

Do you think we should invest more time in a better solution and/or something that is really upstreamable? Or is this ok to complete the migration?

kysrpex commented 10 months ago

@sanjaysrikakulam This is the same patch to the Galaxy code we have just tested.

bgruening commented 10 months ago

Wouldn't it be better to have a overwrite per command?

docker_docker_cdocker_odocker_cdocker_c

e.g. condor_rm_cmd and condor_queue_cmd ...

kysrpex commented 10 months ago

Wouldn't it be better to have a overwrite per command?

docker_docker_cdocker_odocker_cdocker_c

e.g. condor_rm_cmd and condor_queue_cmd ...

@bgruening So the answer to

Do you think we should invest more time in a better solution and/or something that is really upstreamable?

is yes?

bgruening commented 10 months ago

Sorry I wanted to paste this link above: https://github.com/galaxyproject/galaxy/blob/dev/lib/galaxy/config/sample/job_conf.xml.sample_advanced#L579

bgruening commented 10 months ago

I don't know please discuss this internally. I assume using condor_rm_cmd makes the patch here and your ansible easier and cleaner.

kysrpex commented 10 months ago

The last commit adds the three optional parameters condor_rm_cmd, condor_ssh_to_job_cmd, condor_submit_cmd that you requested (the runner does not call any other condor commands).

In addition and more importantly, it also makes the runner invoke everything via shell, so

The shortcoming of this approach is that the contents of prefix are constrained to strings that, when prepended to the condor_* commands, still point to an executable (i.e. no shell commands). Thus, such executable files need to be created and placed somewhere. A minimal example is available here. This constraint exists because of how Galaxy calls the subprocess.Popen constructor (using shell=False).

no longer applies and one can do tricks like this (this is what I did to test it),

runners:
  condor:
    load: galaxy.jobs.runners.condor:CondorJobRunner
  condor_secondary:
    load: galaxy.jobs.runners.condor:CondorJobRunner
    prefix: "touch /tmp/prefix_works; "
    condor_rm_cmd: "touch /tmp/condor_rm_cmd_works; condor_rm"
    condor_ssh_to_job_cmd: "touch /tmp/condor_ssh_to_job_cmd_works; condor_ssh_to_job"
    condor_submit_cmd: "touch /tmp/condor_submit_cmd_works; condor_submit"

which definitely makes the Ansible playbooks cleaner, because then systemd-run can be set as prefix.

kysrpex commented 10 months ago

Additionally, while testing this I found out that there is a bug in Galaxy that prevents it from stopping and killing Docker containers. Thus, whenever a user clicks the trash icon to remove a running Docker job, the container is not stopped. The error manifests in the logs as follows,

Nov 03 10:**:** sn06.galaxyproject.eu python[2314207]: galaxy.jobs.runners.condor WARNING 2023-11-03 10:**:**,*** [pN:handler_sn06_*,p:2314207,tN:JobHandlerStopQueue.monitor_thread] stop_job(): ********: trying to kill container failed. ('commands')

and this is the line of code that triggers it. The reason is a KeyError ("commands" key). cont.container_info is an instance of galaxy.model.custom_types.MutationDict, but printing cont.container_info shows {} (an empty dictionary).

I will report the issue on the Galaxy issue tracker.

bgruening commented 10 months ago

Have you read https://docs.python.org/3.7/library/subprocess.html#security-considerations and why we should not use shell=True? Are you sure you can not get the same results with shell=False?

kysrpex commented 10 months ago

Have you read https://docs.python.org/3.7/library/subprocess.html#security-considerations and why we should not use shell=True? Are you sure you can not get the same results with shell=False?

That's a very good point. Let's look at what are the possible ways to take advantage of shell injection here:

condor_submit: {prefix}{command} {submit_file}. We configure the prefix and the command. Can submit_file be malicious? That's a string with a path on it, crafted by Galaxy.
condor_rm: {prefix}{command} {external_id}. external_id is generated by HTCondor.
condor_ssh_to_job: f"{self.runner_params.get('prefix', '')}{self.runner_params.get('condor_ssh_to_job_cmd')} {external_job_id} {command}. Again, external_job_id is generated by HTCondor. command comes from job_wrapper.get_job().container.container_info["commands"][command]. job_wrapper.get_job() yields a model.Job, so I guess container is what is stored on the job container association table. Looking at psql that seems to be a JSON blob. I have no idea what could be in there. I would have to spend a few hours diving into the Galaxy code to find out with certainty, or alternatively (although less reliably), reading examples from the database.

I think the conclusion is, if we trust Galaxy developers and HTCondor developers, then the two first points are ok. If we trust the tools we install, then the third point is probably also ok. We are not dealing with untrusted inputs here.

kysrpex commented 10 months ago

If there are no further objections, let's move on?

bgruening commented 10 months ago

Go for it ... but please try to get those changes or similar ones also upstream for the next release.

usegalaxy-eu / galaxy

Add support for prefixing calls to HTCondor binaries to the HTCondor runner #202

How to test the changes?

License