jbloomlab / Ab-CGGnaive_DMS

deep mutational scanning of the CGGnaive antibody
1 stars 0 forks source link

Some issues running pipeline #14

Open jgallowa07 opened 2 months ago

jgallowa07 commented 2 months ago

Overview

First, great pipeline, I like the way the key files and pages are automatically generated.

As I'm running the pipeline to work in the changes described in improved_Kd_fitting, I've run into a few issues that may be worth either patching, or at least documenting for future users who may run into similar issues.

conda environment

I ran into issues that took me a while to figure out when attempting to create the environment defined in environment.yml. I'm a mamba user and using the given README command conda env create -f environment.yml -p ./env Gave me package conflicts and hung unexpectedly. I was indeed able to get the environment built, but that required I completely nuke my mamba install and re-install miniconda. This is fine, but a problem arises because snakemake recommends using mamba ... so when executing the pipeline the first thing one sees is:

Error: mamba package manager is not available. The mamba package manager (https://github.com/mamba-org/mamba) is an extremely fast and robust conda replacement. It is the recommended way of using
 Snakemake's conda integration. It can be installed with `conda install -n base -c conda-forge mamba.If you still prefer to use conda, you can enforce that by setting `--conda-frontend conda`.

The simple fix is obviously adding --conda-frontend conda to the run_Hutch_cluster.sh, but a more robust fix would involve solving the dependency conflicts such that the environment can indeed be build directly with mamba, and the environment has more exact pinned dependencies and is built by effect of snakemake run, not the user. I'm happy to take a swing at that if we think it would be helpful.

Running on the cluster using sbatch

I'm a little confused how the cluster execution is happening here, but I have less experience farming jobs from snakemake --and it seems the approach has drastically changed through the versions). Anyhow, it failed at first attempt. It seems the failed jobs gave log information in "slurm-.out" files which reveled the underlying cause of the failure above - and it is related to the same environment issue described above:

Error: mamba package manager is not available. 

So it seems if we're to add --conda-frontend conda argument to the command in run_Hutch_cluster.sh that would solve things ... but when I add that I seem to get the same error? This could potentially be solved by choosing an earlier version of snakemake that defaults to conda frontend instead of mamba, or specify in the README that the user should configure their snakemake inside the environment to use that frontend instead. I have not tried either of these yet as it seems to work just running just fine without batch submission for the time being.

missing input data

The README states that this repository contains all the necessary input data to run the pipeline, yet in the first rule run of the pipeline count_variants.ipynb, I get the error

OSError: no FASTQ file /shared/ngs/illumina/tstarr/210629_D00300_1271_BHM55FBCX3/Unaligned/Project_tstarr/CGGexpr_lib2_bin3-1_S91_R1_001.fastq.gz

because the pipeline is searching for files I don't have access to ...

(/fh/fast/matsen_e/jgallowa/Ab-CGGnaive_DMS/env) jgallowa@gizmok34:/fh/fast/matsen_e/jgallowa/Ab-CGGnaive_DMS$ ls /shared/ngs/illumina/tstarr/
ls: cannot open directory '/shared/ngs/illumina/tstarr/': Permission denied

It seems that this step need not be run given that the the output files exist stored in the results but nonetheless this what it's giving me. I don't think you need to store all the NGS ... but maybe specify at which intermediate step the user is able to run the pipeline without needing to download from SRA or the like?

SOLVED: it seems the pipeline (or possibly me by mistake) somehow deleted the output files from count_variants.ipynb: variant_counts.csv, and count_variants.md. Not sure how that happened but when I reset those files, the pipeline does indeed skip this step and seems to proceed on without error.

wsdewitt commented 2 months ago

Thanks Jared! Following Snakemake best practice (see here), I would recommend partitioning the current global env into several envs—one for each rule. This would likely eliminate many dependency conflicts.

jgallowa07 commented 2 months ago

This pipeline has some more issues to solve before I consider it reproducible for publication. Digging into some of the issues outlined above here are the following things that need to be solved before I would feel comfortable with the results from this pipeline.

Missing Data

it seems the pipeline (or possibly me by mistake) somehow deleted the output files from count_variants.ipynb: variant_counts.csv, and count_variants.md. Not sure how that happened but when I reset those files, the pipeline does indeed skip this step and seems to proceed on without error.

So this is a slightly more complex problem than I initially thought. The overarching problem here is that this repository does not include the files necessary to run the upstream rules (i.e process_ccs, and count_variants). Upon initial execution, snakemake tries to execute count_variants (even though the outputs are available) due to "modified input" (when the input is actually missing). This results in snakemake deleting these two files I mentioned above. The only way I've found to fix this is by letting it fail, and resetting these files

git checkout HEAD -- results/summary/count_variants.md
git checkout HEAD -- results/summary/variant_counts.csv

I'm not exactly sure where this behavior comes from, or how to fix it -- especially given that this is a quite old version of snakemake. How big, exactly, are the NGS files? could they be compressed and tracked via git lfs such that this repository actually contains all the input data necessary? @tylernstarr

R environment

It's not necessarily the most reproducible thing to say that the environment you need for half the pipeline only exists on a single institutions cluster module. We need to be able to reproduce this environment in a portable manor. Over on the improved_Kd_fitting branch, I've done my best to create smaller environments for the pipeline - which of course works great for python - but is an absolute pain for the R. I'm continuing to try to build an environment that works but I'm so unfamiliar with R that things are a little frustratingly trial and error at the moment. I'm almost tempted to rewrite the R code in python. This would certainly make @wsdewitt 's approach easier to implement in a cohesive manor with the rest of the analysis (e.g. expression/PSR).

wsdewitt commented 2 months ago

Thanks, Jared! The readme notes

Note the sequencing data is not included in this repository and thus the processing of the ccs and the variants counts step are not able to be run from the data

so can we configure snakemake to raise an error if the data paths are not supplied (in a config file or on the command line)? Additionally, we could add a flag (call it "raw") to run the pipeline starting from the raw sequencing files, whereas without this flag, the pipeline takes count_variants.md and variant_counts.csv as input. I believe this could be done with conditional statements for these files (as either rule outputs or paths to existing files). If the raw data is huge, I think it's normal not to include it in the repo (but it should be made available somewhere).

You mentioned the snakemake version is very old—should we upgrade to the latest?

Re R environment, should we specify a conda env with R (https://docs.anaconda.com/working-with-conda/packages/using-r-language/)?

jgallowa07 commented 2 months ago

Thanks, Jared! The readme notes

Yes - I added this to the README on our topic branch.

so can we configure snakemake to raise an error if the data paths are not supplied (in a config file or on the command line)? Additionally, we could add a flag (call it "raw") to run the pipeline starting from the raw sequencing files, whereas without this flag, the pipeline takes count_variants.md and variant_counts.csv as input.

Yea this seems reasonable to me. Happy to do so.

You mentioned the snakemake version is very old—should we upgrade to the latest?

Sure - happy to do so! Hopefully syntax and API are backwards compatible - but I think snakemake generally tires hard to make this the case.

Re R environment, should we specify a conda env with R (https://docs.anaconda.com/working-with-conda/packages/using-r-language/)?

I've already refactored the pipeline to use environments defined here. But currently I'm just spending a lot of time debugging the R environment. e.g. The latest problem I run into is:

Quitting from lines 15-36 (compute_binding_Kd.Rmd)
Error in contrib.url(repos, type) :                                          
  trying to use CRAN without setting a mirror                  
Calls: <Anonymous> ... eval -> install.packages -> startsWith -> contrib.url

Execution halted

Note that this specific problem arises in the old Kd fitting steps. I could go ahead and hack that out of the pipeline, but then I'd have to modify all the downstream R code. I will keep trying to make the R code work with conda.

Once it does -- I'll likely feel most comfortable if we containerized things like this.

wsdewitt commented 2 months ago

I do think we want to remove compute_binding_Kd.Rmd for this PR, since it's superseded by your new rule.

jgallowa07 commented 2 months ago

Yea, that will require modification of the downstream R, obviously, but I agree. I think I've finally solved the environment stuff so hopefully the rest goes fairly quickly.