usegalaxy-eu / galaxy

Data intensive science for everyone.
https://galaxyproject.org/
Other
2 stars 11 forks source link

Add support for prefixing calls to HTCondor binaries to the HTCondor runner #202

Closed kysrpex closed 10 months ago

kysrpex commented 10 months ago

This PR is meant to be used for the HTCondor migration. It defines an optional parameter prefix for the HTCondor job runner that prepends the parameter value to all calls to HTCondor binaries, such as condor_submit, condor_rm or condor_ssh_to_job.

This enables having two HTCondor runners defined in the Galaxy job configuration file, with each one calling different executable files and thus allowing to route jobs to two different HTCondor clusters.

The shortcoming of this approach is that the contents of prefix are constrained to strings that, when prepended to the condor_* commands, still point to an executable (i.e. no shell commands). Thus, such executable files need to be created and placed somewhere. A minimal example is available here. This constraint exists because of how Galaxy calls the subprocess.Popen constructor (using shell=False).

How to test the changes?

(Select all options that apply)

License

kysrpex commented 10 months ago

Do you think we should invest more time in a better solution and/or something that is really upstreamable? Or is this ok to complete the migration?

kysrpex commented 10 months ago

@sanjaysrikakulam This is the same patch to the Galaxy code we have just tested.

bgruening commented 10 months ago

Wouldn't it be better to have a overwrite per command?

docker_docker_cdocker_odocker_cdocker_c

e.g. condor_rm_cmd and condor_queue_cmd ...

kysrpex commented 10 months ago

Wouldn't it be better to have a overwrite per command?

docker_docker_cdocker_odocker_cdocker_c

e.g. condor_rm_cmd and condor_queue_cmd ...

@bgruening So the answer to

Do you think we should invest more time in a better solution and/or something that is really upstreamable?

is yes?

bgruening commented 10 months ago

Sorry I wanted to paste this link above: https://github.com/galaxyproject/galaxy/blob/dev/lib/galaxy/config/sample/job_conf.xml.sample_advanced#L579

bgruening commented 10 months ago

I don't know please discuss this internally. I assume using condor_rm_cmd makes the patch here and your ansible easier and cleaner.

kysrpex commented 10 months ago

The last commit adds the three optional parameters condor_rm_cmd, condor_ssh_to_job_cmd, condor_submit_cmd that you requested (the runner does not call any other condor commands).

In addition and more importantly, it also makes the runner invoke everything via shell, so

The shortcoming of this approach is that the contents of prefix are constrained to strings that, when prepended to the condor_* commands, still point to an executable (i.e. no shell commands). Thus, such executable files need to be created and placed somewhere. A minimal example is available here. This constraint exists because of how Galaxy calls the subprocess.Popen constructor (using shell=False).

no longer applies and one can do tricks like this (this is what I did to test it),

runners:
  condor:
    load: galaxy.jobs.runners.condor:CondorJobRunner
  condor_secondary:
    load: galaxy.jobs.runners.condor:CondorJobRunner
    prefix: "touch /tmp/prefix_works; "
    condor_rm_cmd: "touch /tmp/condor_rm_cmd_works; condor_rm"
    condor_ssh_to_job_cmd: "touch /tmp/condor_ssh_to_job_cmd_works; condor_ssh_to_job"
    condor_submit_cmd: "touch /tmp/condor_submit_cmd_works; condor_submit"

which definitely makes the Ansible playbooks cleaner, because then systemd-run can be set as prefix.

kysrpex commented 10 months ago

Additionally, while testing this I found out that there is a bug in Galaxy that prevents it from stopping and killing Docker containers. Thus, whenever a user clicks the trash icon to remove a running Docker job, the container is not stopped. The error manifests in the logs as follows,

Nov 03 10:**:** sn06.galaxyproject.eu python[2314207]: galaxy.jobs.runners.condor WARNING 2023-11-03 10:**:**,*** [pN:handler_sn06_*,p:2314207,tN:JobHandlerStopQueue.monitor_thread] stop_job(): ********: trying to kill container failed. ('commands')

and this is the line of code that triggers it. The reason is a KeyError ("commands" key). cont.container_info is an instance of galaxy.model.custom_types.MutationDict, but printing cont.container_info shows {} (an empty dictionary).

I will report the issue on the Galaxy issue tracker.

bgruening commented 10 months ago

Have you read https://docs.python.org/3.7/library/subprocess.html#security-considerations and why we should not use shell=True? Are you sure you can not get the same results with shell=False?

kysrpex commented 10 months ago

Have you read https://docs.python.org/3.7/library/subprocess.html#security-considerations and why we should not use shell=True? Are you sure you can not get the same results with shell=False?

That's a very good point. Let's look at what are the possible ways to take advantage of shell injection here:

I think the conclusion is, if we trust Galaxy developers and HTCondor developers, then the two first points are ok. If we trust the tools we install, then the third point is probably also ok. We are not dealing with untrusted inputs here.

kysrpex commented 10 months ago

If there are no further objections, let's move on?

bgruening commented 10 months ago

Go for it ... but please try to get those changes or similar ones also upstream for the next release.