Open jgallowa07 opened 2 months ago
Thanks Jared! Following Snakemake best practice (see here), I would recommend partitioning the current global env into several envs—one for each rule. This would likely eliminate many dependency conflicts.
This pipeline has some more issues to solve before I consider it reproducible for publication. Digging into some of the issues outlined above here are the following things that need to be solved before I would feel comfortable with the results from this pipeline.
it seems the pipeline (or possibly me by mistake) somehow deleted the output files from count_variants.ipynb: variant_counts.csv, and count_variants.md. Not sure how that happened but when I reset those files, the pipeline does indeed skip this step and seems to proceed on without error.
So this is a slightly more complex problem than I initially thought. The overarching problem here is that this repository does not include the files necessary to run the upstream rules (i.e process_ccs, and count_variants). Upon initial execution, snakemake
tries to execute count_variants
(even though the outputs are available) due to "modified input" (when the input is actually missing). This results in snakemake deleting these two files I mentioned above. The only way I've found to fix this is by letting it fail, and resetting these files
git checkout HEAD -- results/summary/count_variants.md
git checkout HEAD -- results/summary/variant_counts.csv
I'm not exactly sure where this behavior comes from, or how to fix it -- especially given that this is a quite old version of snakemake
. How big, exactly, are the NGS files? could they be compressed and tracked via git lfs
such that this repository actually contains all the input data necessary? @tylernstarr
It's not necessarily the most reproducible thing to say that the environment you need for half the pipeline only exists on a single institutions cluster module. We need to be able to reproduce this environment in a portable manor. Over on the improved_Kd_fitting
branch, I've done my best to create smaller environments for the pipeline - which of course works great for python - but is an absolute pain for the R. I'm continuing to try to build an environment that works but I'm so unfamiliar with R that things are a little frustratingly trial and error at the moment. I'm almost tempted to rewrite the R code in python. This would certainly make @wsdewitt 's approach easier to implement in a cohesive manor with the rest of the analysis (e.g. expression/PSR).
Thanks, Jared! The readme notes
Note the sequencing data is not included in this repository and thus the processing of the ccs and the variants counts step are not able to be run from the data
so can we configure snakemake to raise an error if the data paths are not supplied (in a config file or on the command line)? Additionally, we could add a flag (call it "raw") to run the pipeline starting from the raw sequencing files, whereas without this flag, the pipeline takes count_variants.md
and variant_counts.csv
as input. I believe this could be done with conditional statements for these files (as either rule outputs or paths to existing files). If the raw data is huge, I think it's normal not to include it in the repo (but it should be made available somewhere).
You mentioned the snakemake version is very old—should we upgrade to the latest?
Re R environment, should we specify a conda env with R (https://docs.anaconda.com/working-with-conda/packages/using-r-language/)?
Thanks, Jared! The readme notes
Yes - I added this to the README on our topic branch.
so can we configure snakemake to raise an error if the data paths are not supplied (in a config file or on the command line)? Additionally, we could add a flag (call it "raw") to run the pipeline starting from the raw sequencing files, whereas without this flag, the pipeline takes count_variants.md and variant_counts.csv as input.
Yea this seems reasonable to me. Happy to do so.
You mentioned the snakemake version is very old—should we upgrade to the latest?
Sure - happy to do so! Hopefully syntax and API are backwards compatible - but I think snakemake
generally tires hard to make this the case.
Re R environment, should we specify a conda env with R (https://docs.anaconda.com/working-with-conda/packages/using-r-language/)?
I've already refactored the pipeline to use environments defined here. But currently I'm just spending a lot of time debugging the R environment. e.g. The latest problem I run into is:
Quitting from lines 15-36 (compute_binding_Kd.Rmd)
Error in contrib.url(repos, type) :
trying to use CRAN without setting a mirror
Calls: <Anonymous> ... eval -> install.packages -> startsWith -> contrib.url
Execution halted
Note that this specific problem arises in the old Kd fitting steps. I could go ahead and hack that out of the pipeline, but then I'd have to modify all the downstream R code. I will keep trying to make the R code work with conda.
Once it does -- I'll likely feel most comfortable if we containerized things like this.
I do think we want to remove compute_binding_Kd.Rmd
for this PR, since it's superseded by your new rule.
Yea, that will require modification of the downstream R, obviously, but I agree. I think I've finally solved the environment stuff so hopefully the rest goes fairly quickly.
Overview
First, great pipeline, I like the way the key files and pages are automatically generated.
As I'm running the pipeline to work in the changes described in improved_Kd_fitting, I've run into a few issues that may be worth either patching, or at least documenting for future users who may run into similar issues.
conda
environmentI ran into issues that took me a while to figure out when attempting to create the environment defined in environment.yml. I'm a mamba user and using the given README command
conda env create -f environment.yml -p ./env
Gave me package conflicts and hung unexpectedly. I was indeed able to get the environment built, but that required I completely nuke mymamba
install and re-installminiconda
. This is fine, but a problem arises becausesnakemake
recommends usingmamba
... so when executing the pipeline the first thing one sees is:The simple fix is obviously adding
--conda-frontend conda
to the run_Hutch_cluster.sh, but a more robust fix would involve solving the dependency conflicts such that the environment can indeed be build directly withmamba
, and the environment has more exact pinned dependencies and is built by effect ofsnakemake run
, not the user. I'm happy to take a swing at that if we think it would be helpful.Running on the cluster using
sbatch
I'm a little confused how the cluster execution is happening here, but I have less experience farming jobs from.out" files which reveled the underlying cause of the failure above - and it is related to the same environment issue described above:
snakemake
--and it seems the approach has drastically changed through the versions). Anyhow, it failed at first attempt. It seems the failed jobs gave log information in "slurm-So it seems if we're to add
--conda-frontend conda
argument to the command in run_Hutch_cluster.sh that would solve things ... but when I add that I seem to get the same error? This could potentially be solved by choosing an earlier version ofsnakemake
that defaults toconda
frontend instead ofmamba
, or specify in the README that the user should configure their snakemake inside the environment to use that frontend instead. I have not tried either of these yet as it seems to work just running just fine without batch submission for the time being.missing input data
The README states that this repository contains all the necessary input data to run the pipeline, yet in the first rule run of the pipeline
count_variants.ipynb
, I get the errorbecause the pipeline is searching for files I don't have access to ...
It seems that this step need not be run given that the the output files exist stored in the results but nonetheless this what it's giving me. I don't think you need to store all the NGS ... but maybe specify at which intermediate step the user is able to run the pipeline without needing to download from SRA or the like?
SOLVED: it seems the pipeline (or possibly me by mistake) somehow deleted the output files from
count_variants.ipynb
: variant_counts.csv, and count_variants.md. Not sure how that happened but when I reset those files, the pipeline does indeed skip this step and seems to proceed on without error.