papaemmelab / toil_container

:whale: Toil + Docker and Singularity.
MIT License
9 stars 6 forks source link

Refactor the way options are provided #28

Open multimeric opened 3 years ago

multimeric commented 3 years ago

toil_container version: 1.1.6 Python version: 3.9.4 Operating System: Ubuntu 20.10

Description

Currently the most useful entry point to toil_container() is the call() function of ContainerJob. Unfortunately, it uses the options dictionary, which is defined at initialization time. This means that:

a) All options have to be known before the job actually runs. This is not always true, as bind mounts might be dynamic, and even the docker/singularity choice could be changed dynamically b) We have to actually pass options every time we instantiate a job. This is verbose and not DRY.

New feature

I propose that we:

jsmedmar commented 3 years ago

This sounds fair. However, if containerCall is a static function, self will not be available.

If you are not planning to use containerJob, then all we really need to do is to extract the little logic in containerJob.call and put that in a single function toil_container.containers.call for example. This function could look like this:

def call(docker=None, singularity=None, working_dir=None, volumes=None, cwd=None, env=None, check_output=False):
    call_kwargs = {}

    if singularity and docker:
        raise exceptions.UsageError("use docker or singularity, not both.")

    if singularity or docker:
        call_kwargs["check_output"] = check_output

        # used for testing only
        call_kwargs["remove_tmp_dir"] = getattr(self, "_rm_tmp_dir", True)

        if working_dir:
            call_kwargs["working_dir"] = working_dir

        if volumes:
            call_kwargs["volumes"] = volumes

        if singularity:
            call_kwargs["image"] = singularity
            call_function = singularity_call
        else:
            call_kwargs["image"] = docker
            call_function = docker_call

    elif check_output:
        call_function = subprocess.check_output

    else:
        call_function = subprocess.check_call

    errors = (exceptions.ContainerError, subprocess.CalledProcessError, OSError)

    try:
        output = call_function(**call_kwargs)
    except errors as error:  # pylint: disable=catching-non-exception
        raise exceptions.SystemCallError(error)

    try:
        return output.decode()
    except AttributeError:
        return output

Then containerJob could just call that function, and you can import container.call wherever you like, passing the arguments you want.

Am I understanding this correctly?

multimeric commented 3 years ago

Yes that's basically exactly what I want to do (and have done, in my WIP code)! I think that would be a big improvement.

The only issue I've encountered so far is in solving point b). Toil doesn't seem to have a mechanism for adding arbitrary data to the config, and accessing it in jobs (see https://github.com/DataBiosphere/toil/issues/3619). So something simple like "use singularity for container jobs" has to be passed to each job which is a pain. The closest I've got is to accessing fileStore.jobStore.config from a job's run function, but sadly this returns the job store config which is only a subset of the total toil config.

jsmedmar commented 3 years ago

I think I understand what you are trying to do.

The --docker and --singularity options come from toil_container and have nothing to do with Toil see: https://github.com/papaemmelab/toil_container/blob/2afa80080fc83a34491cd80efe05159ddc386729/toil_container/parsers.py#L148-L228

Where the toil options are added in a parent class:

https://github.com/papaemmelab/toil_container/blob/2afa80080fc83a34491cd80efe05159ddc386729/toil_container/parsers.py#L59-L61

One reason you might not be seeing --singularity and --docker in self.jobStore.config is because you are not using toil_container.parsers.ContainerArgumentParser. Can you confirm you are using it? Here is an example of how to use it (you can ignore completely the ContainerJob stuff:

# whalesay.py
from toil_container import ContainerJob
from toil_container import ContainerArgumentParser

class WhaleSayJob(ContainerJob):

    def run(self, fileStore):
        """Run `cowsay` with Docker, Singularity or Subprocess."""
        msg = self.call(["cowsay", self.options.msg], check_output=True)
        fileStore.logToMaster(msg)

def main():
    parser = ContainerArgumentParser()
    parser.add_argument("-m", "--msg", default="Hello from the ocean!")
    options = parser.parse_args()
    job = WhaleSayJob(options=options)
    ContainerJob.Runner.startToil(job, options)

if __name__ == "__main__":
    main()

If after checking this and confirming that --docker and --singularity can be retrieved from self.jobStore.config, we might also be able to remove the requirement of the options attribute in containerJob. This would make our pipelines DRYer, so looking forward to your reporting.

jsmedmar commented 3 years ago

update -- I just ran a quick test on this. So there is no self.jobStore. jobStore is an attribute of the fileStore argument passed to the run method. I checked, and it seems that only Toil arguments make it to fileStore.jobStore.config, any additional argument added to the argparse object doesn't make it to config.

multimeric commented 3 years ago

Yes I noted that exact result above:

The closest I've got is to accessing fileStore.jobStore.config from a job's run function, but sadly this returns the job store config which is only a subset of the total toil config.

I'll just have to hope I get an answer back from the toil folk.