Open multimeric opened 3 years ago
This sounds fair. However, if containerCall
is a static function, self
will not be available.
If you are not planning to use containerJob
, then all we really need to do is to extract the little logic in containerJob.call
and put that in a single function toil_container.containers.call
for example. This function could look like this:
def call(docker=None, singularity=None, working_dir=None, volumes=None, cwd=None, env=None, check_output=False):
call_kwargs = {}
if singularity and docker:
raise exceptions.UsageError("use docker or singularity, not both.")
if singularity or docker:
call_kwargs["check_output"] = check_output
# used for testing only
call_kwargs["remove_tmp_dir"] = getattr(self, "_rm_tmp_dir", True)
if working_dir:
call_kwargs["working_dir"] = working_dir
if volumes:
call_kwargs["volumes"] = volumes
if singularity:
call_kwargs["image"] = singularity
call_function = singularity_call
else:
call_kwargs["image"] = docker
call_function = docker_call
elif check_output:
call_function = subprocess.check_output
else:
call_function = subprocess.check_call
errors = (exceptions.ContainerError, subprocess.CalledProcessError, OSError)
try:
output = call_function(**call_kwargs)
except errors as error: # pylint: disable=catching-non-exception
raise exceptions.SystemCallError(error)
try:
return output.decode()
except AttributeError:
return output
Then containerJob
could just call that function, and you can import container.call
wherever you like, passing the arguments you want.
Am I understanding this correctly?
Yes that's basically exactly what I want to do (and have done, in my WIP code)! I think that would be a big improvement.
The only issue I've encountered so far is in solving point b). Toil doesn't seem to have a mechanism for adding arbitrary data to the config, and accessing it in jobs (see https://github.com/DataBiosphere/toil/issues/3619). So something simple like "use singularity for container jobs" has to be passed to each job which is a pain. The closest I've got is to accessing fileStore.jobStore.config
from a job's run
function, but sadly this returns the job store config which is only a subset of the total toil config.
I think I understand what you are trying to do.
The --docker
and --singularity
options come from toil_container
and have nothing to do with Toil
see: https://github.com/papaemmelab/toil_container/blob/2afa80080fc83a34491cd80efe05159ddc386729/toil_container/parsers.py#L148-L228
Where the toil options are added in a parent class:
One reason you might not be seeing --singularity
and --docker
in self.jobStore.config
is because you are not using toil_container.parsers.ContainerArgumentParser
. Can you confirm you are using it? Here is an example of how to use it (you can ignore completely the ContainerJob
stuff:
# whalesay.py
from toil_container import ContainerJob
from toil_container import ContainerArgumentParser
class WhaleSayJob(ContainerJob):
def run(self, fileStore):
"""Run `cowsay` with Docker, Singularity or Subprocess."""
msg = self.call(["cowsay", self.options.msg], check_output=True)
fileStore.logToMaster(msg)
def main():
parser = ContainerArgumentParser()
parser.add_argument("-m", "--msg", default="Hello from the ocean!")
options = parser.parse_args()
job = WhaleSayJob(options=options)
ContainerJob.Runner.startToil(job, options)
if __name__ == "__main__":
main()
If after checking this and confirming that --docker
and --singularity
can be retrieved from self.jobStore.config
, we might also be able to remove the requirement of the options
attribute in containerJob
. This would make our pipelines DRYer, so looking forward to your reporting.
update -- I just ran a quick test on this. So there is no self.jobStore
. jobStore
is an attribute of the fileStore
argument passed to the run
method. I checked, and it seems that only Toil
arguments make it to fileStore.jobStore.config
, any additional argument added to the argparse
object doesn't make it to config
.
Yes I noted that exact result above:
The closest I've got is to accessing fileStore.jobStore.config from a job's run function, but sadly this returns the job store config which is only a subset of the total toil config.
I'll just have to hope I get an answer back from the toil folk.
toil_container version: 1.1.6 Python version: 3.9.4 Operating System: Ubuntu 20.10
Description
Currently the most useful entry point to
toil_container()
is thecall()
function ofContainerJob
. Unfortunately, it uses theoptions
dictionary, which is defined at initialization time. This means that:a) All options have to be known before the job actually runs. This is not always true, as bind mounts might be dynamic, and even the docker/singularity choice could be changed dynamically b) We have to actually pass
options
every time we instantiate a job. This is verbose and not DRY.New feature
I propose that we:
containerJob.call
out into a static function such ascontainerCall()
.containerCall()
then takes the "options" (docker/singularity, mounts etc) as an argument, and does not rely onself.options
. This resolves my first point.containerJob
then usescontainerCall()
internally for backwards-compatibility (I have no intention to usecontainerJob
myself, though)containerCall()
pull its default options directly from the Toil config object, accessed throughself.jobStore.config
from any given job. These can still be overridden by arguments. This system means that, if you have--singularity
enabled, all jobs will automatically use singularity (by default) without having to passoptions
around (my second point above)