snakemake-workflows / rna-seq-kallisto-sleuth

A Snakemake workflow for differential expression analysis of RNA-seq data with Kallisto and Sleuth.
MIT License
66 stars 44 forks source link

cutadapt not available? #84

Closed him1532 closed 11 months ago

him1532 commented 11 months ago

Hi, Thank you for building a nice tool and a pipeline. First time user of snakemake. I followed the document to install the snakemake.

my sample.tsv looks like

sample condition batch_effect EOL-1 DMSO batch1 EOL-1 GSK batch1

units.tsv look like

sample unit fragment_len_mean fragment_len_sd fq1 fq2 EOL-1 1 NA NA /WTS/RNAseq_test_1D_231201_N12/EOL-1_DMSO_1.fq.gz /NAS2/WTS/RNAseq_test_1D_231201_N12/EOL-1_DMSO_2.fq.gz EOL-1 1 NA NA /WTS/RNAseq_test_1D_231201_N12/EOL-1_GSK_1.fq.gz /WTS/RNAseq_test_1D_231201_N12/EOL-1_GSK_2.fq.gz

when I run snakemake --cluster qsub --jobs 1 following is the output.

Workflow defines that rule get_transcriptome is eligible for caching between workflows (use the -->cache argument to enable this). Workflow defines that rule get_annotation is eligible for caching between workflows (use the --cache >argument to enable this). Workflow defines that rule get_transcript_info is eligible for caching between workflows (use the -->cache argument to enable this). Workflow defines that rule convert_pfam is eligible for caching between workflows (use the --cache >argument to enable this). Workflow defines that rule calculate_cpat_hexamers is eligible for caching between workflows (use the >--cache argument to enable this). Workflow defines that rule calculate_cpat_logit_model is eligible for caching between workflows (use >the --cache argument to enable this). Workflow defines that rule get_spia_db is eligible for caching between workflows (use the --cache >argument to enable this). Building DAG of jobs... Using shell: /usr/bin/bash Provided cluster nodes: 1 Conda environments: ignored Singularity containers: ignored

Job stats: job count


all 1 compose_sample_sheet 2 cutadapt_pe 2 diffexp_datavzrd 1 get_transcript_info 1 get_transcriptome 1 ihw_fdr_control 3 kallisto_index 1 kallisto_quant 2 logcount_matrix 1 plot_bootstrap 1 plot_diffexp_heatmap 1 plot_diffexp_pval_hist 3 plot_fragment_length_dist 2 plot_group_density 1 plot_pca 1 render_datavzrd_config_diffexp 1 sleuth_diffexp 1 sleuth_init 2 vega_volcano_plot 1 total 29

Select jobs to execute...

[Fri Dec 15 13:56:28 2023] rule cutadapt_pe: input: /WTS/RNAseq_test_1D_231201_N12/EOL-1_DMSO_1.fq.gz, /WTS/RNAseq_test_1D_231201_N12/EOL-1_DMSO_2.fq.gz output: results/trimmed/EOL-1-1.1.fastq.gz, results/trimmed/EOL-1-1.2.fastq.gz, >results/trimmed/EOL-1-1.qc.txt log: results/logs/cutadapt/EOL-1-1.log jobid: 4 reason: Missing output files: results/trimmed/EOL-1-1.1.fastq.gz, results/trimmed/EOL-1-1.2.fastq.gz wildcards: sample=EOL-1, unit=1 threads: 8 resources: mem_mb=15400, mem_mib=14687, disk_mb=15400, disk_mib=14687, tmpdir=

Submitted job 4 with external jobid 'Your job 1509737 ("snakejob.cutadapt_pe.4.sh") >has been submitted'. [Fri Dec 15 13:56:48 2023] Error in rule cutadapt_pe: jobid: 4 input: /WTS/RNAseq_test_1D_231201_N12/EOL-1_DMSO_1.fq.gz, /WTS/RNAseq_test_1D_231201_N12/EOL-1_DMSO2.fq.gz output: results/trimmed/EOL-1-1.1.fastq.gz, results/trimmed/EOL-1-1.2.fastq.gz, >results/trimmed/EOL-1-1.qc.txt log: results/logs/cutadapt/EOL-1-1.log (check log file(s) for error details) conda-env: /home/him/RNA/.snakemake/conda/b97fd9bce60732e534a403eba0f5c294 cluster_jobid: Your job 1509737 ("snakejob.cutadapt_pe.4.sh") has been submitted

Error executing rule cutadapt_pe on cluster (jobid: 4, external: Your job 1509737 ("snakejob.cutadapt_pe.4.sh") has been submitted, jobscript: /home/him/>RNA/.snakemake/tmp.s42vvmk3/snakejob.cutadapt_pe.4.sh). For error details see the >cluster log and the log files of the involved rule(s). Shutting down, this might take some time. Exiting because a job execution failed. Look above for error message Complete log: .snakemake/log/2023-12-15T135619.975453.snakemake.log

When I looked at the log file

cat results/logs/cutadapt/EOL-1-1.log /bin/bash: cutadapt: command not found

my question is Should I install cutadapt and other tools?

Thank you.

dlaehnemann commented 11 months ago

I'm assuming this command you mentioned is the full command you are using to run the workflow, right?

snakemake --cluster qsub --jobs 1

This workflow (and I think most snakemake workflows out there) uses environment files defined via the conda: directive to install the necessary software using the tool conda or mamba (that you most probably also used to install snakemake itself). However, you have to actively tell snakemake to --use-conda, with this flag on the command-line. I guess the error message could very well be improved for this failure case...

I'm closing this issue, with the assumption that this is the problem. If it is something else, feel free to reopen this issue. Or open other issues if you come across further problems. Otherwise, happy analysis!

him1532 commented 10 months ago

Thank you for the advice. I tried the following command this time. snakemake --useconda --job 1 --cluster "qsub -V -b y -S /bin/bash" and it's stepping through 29 steps of jobs. One question I had was what can I do to not wait for all the 29 steps to finish or accidentally close the terminal window? one idea was to but "&" symbol at the end of the commend. would this work? also I am not quite clear on how this pipeline is working. is there a way to save the jobscripts being submitted to the cluster? Thank you.

dlaehnemann commented 10 months ago

If you want to keep snakemake running, for example on a server, and be able to log out, you should for example start it in a screen session. Some useful links are here: https://koesterlab.github.io/data-science-for-bioinfo/servers/screen.html

Also, feel free to look around that knowledge base for further recommendations on snakemake and on bioinformatics more generally.

Also, as you are using a cluster system, you can probably increase the number of --jobs that are submitted in parallel. Whenever multiple samples can be handled in parallel, snakemake will automatically do that for you.

And if you want to know more about what is actually run, you can look around the repository here. All the rules that are executed are in .smk files in the workflow/rules/ directory, all scripts are in workflow/scripts/. And if a rule uses a wrapper: directive (for example the rule kallisto_quant:), you can look that wrapper up in the snakemake wrapper repository / docs (for example the kallisto/quant wrapper). These docs are versioned, they list all dependencies of a wrapper and the actual code that gets executed.