Since we want to run PHANTA on both the masked and unmasked databases, as well as the post processing scripts, I have incorporated it into the kraken module. There are a few little quirks to note:
Snakemake within snakemake
The two main problems I was attempting to overcome are how do we use phanta and how do we ensure phanta has the dependencies it needs.
For getting phanta, the three options I saw were via git submodule, snakemake remote, or bundling it into a docker container. Since we already containerize our rule dependencies, I decided to go with docker.
In order to avoid dealing with conda environments, I made the docker container with all the dependencies conveniently described by the conda env in the phanta repo.
However, when I tried running this, I ran into the issue that jobs spawned from this rule were not being run from the same container. To address this, I then added a self-referential containerize directive to the main Snakefile; see
https://github.com/vdblab/dockerfiles/blob/967ed8d70d328a726c573c92bb50b3ae33a8c926/phanta/Dockerfile#L11. I tried avoiding this using the handover directive, and but using the --container cli arg, but neither worked.
I ended up having this rule run as a "local" job (that is, local to the node the phanta rules are distibuted to, as opposed to those nodes then acting as the head node sending rules to other nodes). The main issue was conflcting mount points when running a containerized workflow from a containerized environment. Because of this, it might be the above containerize is not necessary.
This PR
Ive added 8 relevant arguments from phanta to our config, as well as a skip_phanta option we can enable for testing. The workflow runs phanta twice: once for the masked database, and once for the unmasked. Two of the downstream scripts have been added as well to predict lifestyle and host-specificity. I did not do the correlation analysis script as that involves running kraken individually for each read direction, which adds runtime/space overhead. But, we could decide to do that in the future.
I also changed the Snakemake profile variable from SNAKEPROFILE to SNAKEMAKE_PROFILE, which snakemake will look for automatically!
Since we want to run PHANTA on both the masked and unmasked databases, as well as the post processing scripts, I have incorporated it into the kraken module. There are a few little quirks to note:
Snakemake within snakemake
The two main problems I was attempting to overcome are how do we use phanta and how do we ensure phanta has the dependencies it needs.
For getting phanta, the three options I saw were via git submodule, snakemake remote, or bundling it into a docker container. Since we already containerize our rule dependencies, I decided to go with docker.
In order to avoid dealing with conda environments, I made the docker container with all the dependencies conveniently described by the conda env in the phanta repo.
However, when I tried running this, I ran into the issue that jobs spawned from this rule were not being run from the same container. To address this, I then added a self-referential
containerize
directive to the main Snakefile; see https://github.com/vdblab/dockerfiles/blob/967ed8d70d328a726c573c92bb50b3ae33a8c926/phanta/Dockerfile#L11. I tried avoiding this using thehandover
directive, and but using the--container
cli arg, but neither worked.I ended up having this rule run as a "local" job (that is, local to the node the phanta rules are distibuted to, as opposed to those nodes then acting as the head node sending rules to other nodes). The main issue was conflcting mount points when running a containerized workflow from a containerized environment. Because of this, it might be the above
containerize
is not necessary.This PR
Ive added 8 relevant arguments from phanta to our config, as well as a
skip_phanta
option we can enable for testing. The workflow runs phanta twice: once for the masked database, and once for the unmasked. Two of the downstream scripts have been added as well to predict lifestyle and host-specificity. I did not do the correlation analysis script as that involves running kraken individually for each read direction, which adds runtime/space overhead. But, we could decide to do that in the future.I also changed the Snakemake profile variable from SNAKEPROFILE to SNAKEMAKE_PROFILE, which snakemake will look for automatically!