Closed kysrpex closed 10 months ago
Do you think we should invest more time in a better solution and/or something that is really upstreamable? Or is this ok to complete the migration?
@sanjaysrikakulam This is the same patch to the Galaxy code we have just tested.
Wouldn't it be better to have a overwrite per command?
docker_docker_cdocker_odocker_cdocker_c
e.g. condor_rm_cmd
and condor_queue_cmd
...
Wouldn't it be better to have a overwrite per command?
docker_docker_cdocker_odocker_cdocker_c
e.g.
condor_rm_cmd
andcondor_queue_cmd
...
@bgruening So the answer to
Do you think we should invest more time in a better solution and/or something that is really upstreamable?
is yes?
Sorry I wanted to paste this link above: https://github.com/galaxyproject/galaxy/blob/dev/lib/galaxy/config/sample/job_conf.xml.sample_advanced#L579
I don't know please discuss this internally. I assume using condor_rm_cmd
makes the patch here and your ansible easier and cleaner.
The last commit adds the three optional parameters condor_rm_cmd
, condor_ssh_to_job_cmd
, condor_submit_cmd
that you requested (the runner does not call any other condor commands).
In addition and more importantly, it also makes the runner invoke everything via shell, so
The shortcoming of this approach is that the contents of
prefix
are constrained to strings that, when prepended to thecondor_*
commands, still point to an executable (i.e. no shell commands). Thus, such executable files need to be created and placed somewhere. A minimal example is available here. This constraint exists because of how Galaxy calls thesubprocess.Popen
constructor (usingshell=False
).
no longer applies and one can do tricks like this (this is what I did to test it),
runners:
condor:
load: galaxy.jobs.runners.condor:CondorJobRunner
condor_secondary:
load: galaxy.jobs.runners.condor:CondorJobRunner
prefix: "touch /tmp/prefix_works; "
condor_rm_cmd: "touch /tmp/condor_rm_cmd_works; condor_rm"
condor_ssh_to_job_cmd: "touch /tmp/condor_ssh_to_job_cmd_works; condor_ssh_to_job"
condor_submit_cmd: "touch /tmp/condor_submit_cmd_works; condor_submit"
which definitely makes the Ansible playbooks cleaner, because then systemd-run
can be set as prefix
.
Additionally, while testing this I found out that there is a bug in Galaxy that prevents it from stopping and killing Docker containers. Thus, whenever a user clicks the trash icon to remove a running Docker job, the container is not stopped. The error manifests in the logs as follows,
Nov 03 10:**:** sn06.galaxyproject.eu python[2314207]: galaxy.jobs.runners.condor WARNING 2023-11-03 10:**:**,*** [pN:handler_sn06_*,p:2314207,tN:JobHandlerStopQueue.monitor_thread] stop_job(): ********: trying to kill container failed. ('commands')
and this is the line of code that triggers it. The reason is a KeyError
("commands"
key). cont.container_info
is an instance of galaxy.model.custom_types.MutationDict
, but printing cont.container_info
shows {}
(an empty dictionary).
I will report the issue on the Galaxy issue tracker.
Have you read https://docs.python.org/3.7/library/subprocess.html#security-considerations and why we should not use shell=True? Are you sure you can not get the same results with shell=False?
Have you read https://docs.python.org/3.7/library/subprocess.html#security-considerations and why we should not use shell=True? Are you sure you can not get the same results with shell=False?
That's a very good point. Let's look at what are the possible ways to take advantage of shell injection here:
{prefix}{command} {submit_file}
. We configure the prefix and the command. Can submit_file
be malicious? That's a string with a path on it, crafted by Galaxy.{prefix}{command} {external_id}
. external_id
is generated by HTCondor.f"{self.runner_params.get('prefix', '')}{self.runner_params.get('condor_ssh_to_job_cmd')} {external_job_id} {command}
. Again, external_job_id
is generated by HTCondor. command
comes from job_wrapper.get_job().container.container_info["commands"][command]
. job_wrapper.get_job()
yields a model.Job
, so I guess container
is what is stored on the job container association table. Looking at psql
that seems to be a JSON blob. I have no idea what could be in there. I would have to spend a few hours diving into the Galaxy code to find out with certainty, or alternatively (although less reliably), reading examples from the database.I think the conclusion is, if we trust Galaxy developers and HTCondor developers, then the two first points are ok. If we trust the tools we install, then the third point is probably also ok. We are not dealing with untrusted inputs here.
If there are no further objections, let's move on?
Go for it ... but please try to get those changes or similar ones also upstream for the next release.
This PR is meant to be used for the HTCondor migration. It defines an optional parameter
prefix
for the HTCondor job runner that prepends the parameter value to all calls to HTCondor binaries, such ascondor_submit
,condor_rm
orcondor_ssh_to_job
.This enables having two HTCondor runners defined in the Galaxy job configuration file, with each one calling different executable files and thus allowing to route jobs to two different HTCondor clusters.
The shortcoming of this approach is that the contents of
prefix
are constrained to strings that, when prepended to thecondor_*
commands, still point to an executable (i.e. no shell commands). Thus, such executable files need to be created and placed somewhere. A minimal example is available here. This constraint exists because of how Galaxy calls thesubprocess.Popen
constructor (usingshell=False
).How to test the changes?
(Select all options that apply)
License