Open PatrickMaclean opened 1 year ago
Weird! When you look in the work directory for the BismarkIndex task, does the Fasta file look normal there?
I don't think that this is a bug in this pipeline particularly, sounds to more more likely to be a core-Nextflow problem.
stageInMode
have any effect? (docs)Same bug for me this morning when ran locally with docker
Reproducible exemple :
OS: Ubuntu LTS 20.03 Ubuntu 22.04.2 LTS (GNU/Linux 5.19.0-32-generic x86_64) nextflow version 23.04.1.5866 nf-core/methylseq v2.3.0-g93bc581 Executor: Local Container engine: Docker version 23.0.3, build 3e7cbfd
Dataset : SRR10532131
nextflow run nf-core/methylseq \
-profile docker \
--outdir out \
--max_cpus 12 \
--input ./samplesheet.csv \
--max_memory 30GB \
--fasta ./GRCh38_latest_genomic.fa \
-bg \
-with-report
Note: The reason why I used
--fasta ./GRCh38_latest_genomic.fa
in the first place is due to the fact that the pipeline could not download GRCh38 from amazon AWS, which is maybe another bug. Maybe it could also explain that the bug has not been spotted earlier?
EDIT :
Is it possible that the samplesheet.csv
overrides a parameter somehow ?
Mine looked like this :
sample,single_end,fastq_1,fastq_2,genome
unknown_patient,False,/home/delevoye/work_ccmc/SRR10532131_1.fastq.gz,/home/delevoye/work_ccmc/SRR10532131_2.fastq.gz,GRCh38
Thanks for posting a reproducible example - sorry Phil I missed your initial response. I've certainly seen the bug with the standard 3 column samplesheet (sample
, fastq_1
, fastq_2
)
@PatrickMaclean Could it be linked with nf-core/methylseq#637 somehow ?
@ewels How do you change the stageInMode ? The doc only says what are the different modes, not how to change them
Tbh I'm a bit confused/overwhelmed between the 6 ways I've found online to specify a custom genome:
Not sure I understand what is the point of all these options
It seems like there are priorities of some kind between these different options, but the hierarchy is not clear to me.
Hi @GDelevoye,
Ok, here goes:
config profiles config files command-line parameter --fasta
These three are the same thing, just different ways of setting the params.fasta
config variable. See docs here and here.
igenome server Refgenie
These are two different ways of getting pre-built Bismark reference genome indices into Nextflow.
AWS-iGenomes is simply a bucket on s3 with a load of common refs + indices hosted, and nf-core pipelines ship with a config that points to these locations. So if you do --genome GRCh37
then params.fasta
(and params.bismark_index
and a bunch of others) will be automatically set to point to the s3 path for that reference.
RefGenie is a similar thing which we are hoping to migrate to, with assets on s3. It's also a standalone CLI tool for managing local reference genomes. We have a plugin in the nf-core
CLI that talks to the refgenie
CLI tool to automatically generate Nextflow config files for you (docs). It ends up being the same as iGenomes, where you do --genome my_ref
and params.fasta
is set to the correct local path, managed by Refgenie.
samplesheet.csv...
This is different to everything else. Earlier versions of the pipeline included the option to specify a different reference genome for every sample. Then this was dropped, but we want to add it back again (I don't think that's been done yet.... right? Ties into the probable bug described here).
It'll end up being functionally the same as the stuff above, but defined on a per-sample basis if you want it to be.
This is all usage questions really, most of this issue relates to what is likely a specific bug in particular usage of the above.
@PatrickMaclean could you share how you have your sample-sheet formatted?
Just the minimum columns - sample, fastq_1, fastq_2 - haven't ever specified a genome on the sample sheet.
Pretty sure that the igenomes thing is a red herring (likely the AWS download didn't work because it's in Europe and you're in a different region - this is a known limitation). If you're using --fasta
then the pipeline should be generating its own index and have zero dependence on anything to do with igenomes.
Also pretty sure that sample-sheet related things are a red-herring. Fields are loaded by header column title, so order and presence of non-required columns should have no effect.
This is the key question I'm missing:
When you look in the work directory for the BismarkIndex task, does the Fasta file look normal there?
Basically, trying to isolate the stage at which this fasta file is being truncated. Is it the bismark index generation step where it's getting messed up, or is it after that - somewhere in the process of staging that file as an input for the alignment process.
Sometimes weird bugs can happen where a process incorrectly tries to write over a file - if it's hardlinked into the work directory then this can also overwrite the original source file. This is why I was wondering about stageInMode
, as copying it into the work directory is safer - should help again to isolate whether it was ok in the BismarkIndex process or not.
stageInMode
is a process-level directive, so it's process.stageInMode = 'copy'
to make it apply to all processes in the pipeline.
Ok, trying to replicate this.
Fetching the data, links from https://sra-explorer.info/
#!/usr/bin/env bash
curl -L ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR105/031/SRR10532131/SRR10532131_1.fastq.gz -o SRR10532131_EM-seq_10_ng_replicate_1_1.fastq.gz
curl -L ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR105/031/SRR10532131/SRR10532131_2.fastq.gz -o SRR10532131_EM-seq_10_ng_replicate_1_2.fastq.gz
Wasn't sure exactly where your reference fasta came from, so downloaded a GRCh38 fasta file from AWS-iGenomes:
wget https://ngi-igenomes.s3.eu-west-1.amazonaws.com/igenomes/Homo_sapiens/NCBI/GRCh38/Sequence/WholeGenomeFasta/genome.fa
Ended up with this:
$ ls -lh
total 18G
-rw-r--r-- 1 gitpod gitpod 300 May 3 20:52 download.sh
-rw-r--r-- 1 gitpod gitpod 3.0G Apr 13 2017 genome.fa
-rw-r--r-- 1 gitpod gitpod 166 May 3 21:00 samplesheet.csv
-rw-r--r-- 1 gitpod gitpod 7.2G May 3 20:56 SRR10532131_EM-seq_10_ng_replicate_1_1.fastq.gz
-rw-r--r-- 1 gitpod gitpod 7.9G May 3 21:00 SRR10532131_EM-seq_10_ng_replicate_1_2.fastq.gz
Based on the above, which has unused columns (single_end
, genome
) but replicating as closely as I can:
sample,single_end,fastq_1,fastq_2,genome
unknown_patient,False,SRR10532131_EM-seq_10_ng_replicate_1_1.fastq.gz,SRR10532131_EM-seq_10_ng_replicate_1_2.fastq.gz,GRCh38
sample | single_end | fastq_1 | fastq_2 | genome |
---|---|---|---|---|
unknown_patient | False | SRR10532131_EM-seq_10_ng_replicate_1_1.fastq.gz | SRR10532131_EM-seq_10_ng_replicate_1_2.fastq.gz | GRCh38 |
Running on Gitpod, but tried to keep it basically the same:
nextflow run nf-core/methylseq \
-profile docker \
--outdir out \
--max_cpus 4 \
--input ./samplesheet.csv \
--max_memory 8GB \
--fasta ./genome.fa \
-with-report
....and, GitPod is going to take about 100 years to run indexing on a full genome + processing a full dataset :/
Come back soon to hear the exciting finale! (or probably not, just that it wouldn't run on GitPod resources)..
(bonus points if anyone can replicate using -profile test
, which uses a tiny reference genome)
Hi,
I'm having the same issue while trying to run a test using sample E. coli data. I was able to get the pipeline to complete successfully using --aligner bwameth
; however, I am getting the following error while invoking --aligner bismark
.
executor > local (4)
[53/c12726] process > NFCORE_METHYLSEQ:METHYLSEQ:PREPARE_GENOME:BISMARK_GENOMEPREPARATION (BismarkIndex/GCF_000005845.2_ASM584v2_genomic.fasta) [100%] 1 of 1 ✔
[- ] process > NFCORE_METHYLSEQ:METHYLSEQ:CAT_FASTQ -
[eb/0d76aa] process > NFCORE_METHYLSEQ:METHYLSEQ:FASTQC (Ecoli_10K) [100%] 1 of 1 ✔
[cc/e2d00d] process > NFCORE_METHYLSEQ:METHYLSEQ:TRIMGALORE (Ecoli_10K) [100%] 1 of 1 ✔
[84/9a5ddd] process > NFCORE_METHYLSEQ:METHYLSEQ:BISMARK:BISMARK_ALIGN (Ecoli_10K) [100%] 1 of 1, failed: 1 ✘
[- ] process > NFCORE_METHYLSEQ:METHYLSEQ:BISMARK:SAMTOOLS_SORT_ALIGNED -
[- ] process > NFCORE_METHYLSEQ:METHYLSEQ:BISMARK:BISMARK_DEDUPLICATE -
[- ] process > NFCORE_METHYLSEQ:METHYLSEQ:BISMARK:BISMARK_METHYLATIONEXTRACTOR -
[- ] process > NFCORE_METHYLSEQ:METHYLSEQ:BISMARK:BISMARK_REPORT -
[- ] process > NFCORE_METHYLSEQ:METHYLSEQ:BISMARK:BISMARK_SUMMARY [ 0%] 0 of 1
[- ] process > NFCORE_METHYLSEQ:METHYLSEQ:BISMARK:SAMTOOLS_SORT_DEDUPLICATED -
[- ] process > NFCORE_METHYLSEQ:METHYLSEQ:QUALIMAP_BAMQC -
[- ] process > NFCORE_METHYLSEQ:METHYLSEQ:PRESEQ_LCEXTRAP -
[- ] process > NFCORE_METHYLSEQ:METHYLSEQ:CUSTOM_DUMPSOFTWAREVERSIONS -
[- ] process > NFCORE_METHYLSEQ:METHYLSEQ:MULTIQC -
Execution cancelled -- Finishing pending tasks before exit
-[nf-core/methylseq] Pipeline completed with errors-
ERROR ~ Error executing process > 'NFCORE_METHYLSEQ:METHYLSEQ:BISMARK:BISMARK_ALIGN (Ecoli_10K)'
Caused by:
Process `NFCORE_METHYLSEQ:METHYLSEQ:BISMARK:BISMARK_ALIGN (Ecoli_10K)` terminated with an error exit status (2)
Command executed:
bismark \
-1 Ecoli_10K_1_val_1.fq.gz -2 Ecoli_10K_2_val_2.fq.gz \
--genome BismarkIndex \
--bam \
--bowtie2 --multicore 4
cat <<-END_VERSIONS > versions.yml
"NFCORE_METHYLSEQ:METHYLSEQ:BISMARK:BISMARK_ALIGN":
bismark: $(echo $(bismark -v 2>&1) | sed 's/^.*Bismark Version: v//; s/Copyright.*$//')
END_VERSIONS
Command exit status:
2
Command output:
(empty)
Command error:
Bowtie 2 seems to be working fine (tested command 'bowtie2 --version' [2.4.5])
Output format is BAM (default)
Alignments will be written out in BAM format. Samtools found here: '/usr/local/bin/samtools'
Reference genome folder provided is BismarkIndex/ proc(absolute path is '/mnt/raid/gfilloramo/methyl-seq/test_data/work/53/c1272690a523e4b0000f18c7840fcb/BismarkIndex/)'
FastQ format assumed (by default)
Input files to be analysed (in current folder '/mnt/raid/gfilloramo/methyl-seq/test_data/work/84/9a5ddd9fa3382410e11e0028933e22'):
Ecoli_10K_1_val_1.fq.gz
Ecoli_10K_2_val_2.fq.gz
Library is assumed to be strand-specific (directional), alignments to strands complementary to the original top or bottom strands will be ignored (i.e. not performed!)
Summary of all aligner options: -q --score-min L,0,-0.2 --ignore-quals --no-mixed --no-discordant --dovetail --maxins 500
Running Bismark Parallel version. Number of parallel instances to be spawned: 4
Current working directory is: /mnt/raid/gfilloramo/methyl-seq/test_data/work/84/9a5ddd9fa3382410e11e0028933e22
Now reading in and storing sequence information of the genome specified in: /mnt/raid/gfilloramo/methyl-seq/test_data/work/53/c1272690a523e4b0000f18c7840fcb/BismarkIndex/
Failed to read from sequence file GCF_000005845.2_ASM584v2_genomic.fasta No such file or directory
My files are:
R1= Ecoli_10K_methylated_R1.fastq.gz
R2= Ecoli_10K_methylated_R2.fastq.gz
reference genome= GCF_000005845.2_ASM584v2_genomic.fasta
My sample sheet is attached samplesheet_test_GVFedit.csv
The command I'm using is:
nextflow run nf-core/methylseq -profile docker --input samplesheet_test_GVFedit.csv --outdir test_results_bisbt2 --fasta GCF_000005845.2_ASM584v2_genomic.fasta --fasta_index GCF_000005845.2_ASM584v2_genomic.fasta.fai –aligner bismark –save_trimmed --save_align_intermeds
Thanks in advance!
Hi,
I have similar problem that the pipeline doesn't find the reference genome although the genome is there and the file is not empty. This happens both --aligner bismark
or bwameth
.
$ nextflow run nf-core/methylseq --input /scratch/project_2010912/hannu/sample_list_test.csv --fasta /scratch/project_2010912/hannu/salmon_major_chromosomes.fasta --save_reference --outdir /scratch/project_2010912/methylseq_output --multiqc_title test_report -profile singularity -resume
Workflow execution completed unsuccessfully!
The exit status of the task that caused the workflow execution to fail was: 2.
The full error message was:
Error executing process > 'NFCORE_METHYLSEQ:METHYLSEQ:BISMARK:BISMARK_ALIGN (14)'
Caused by:
Process `NFCORE_METHYLSEQ:METHYLSEQ:BISMARK:BISMARK_ALIGN (14)` terminated with an error exit status (2)
Command executed:
bismark \
-1 14_1_val_1.fq.gz -2 14_2_val_2.fq.gz \
--genome BismarkIndex \
--bam \
--bowtie2 --multicore 4
cat <<-END_VERSIONS > versions.yml
"NFCORE_METHYLSEQ:METHYLSEQ:BISMARK:BISMARK_ALIGN":
bismark: $(echo $(bismark -v 2>&1) | sed 's/^.*Bismark Version: v//; s/Copyright.*$//')
END_VERSIONS
Command exit status:
2
Command output:
(empty)
Command error:
INFO: Environment variable SINGULARITYENV_TMPDIR is set, but APPTAINERENV_TMPDIR is preferred
INFO: Environment variable SINGULARITYENV_NXF_TASK_WORKDIR is set, but APPTAINERENV_NXF_TASK_WORKDIR is preferred
INFO: Environment variable SINGULARITYENV_NXF_DEBUG is set, but APPTAINERENV_NXF_DEBUG is preferred
Bowtie 2 seems to be working fine (tested command 'bowtie2 --version' [2.4.5])
Output format is BAM (default)
Alignments will be written out in BAM format. Samtools found here: '/usr/local/bin/samtools'
Reference genome folder provided is BismarkIndex/ (absolute path is '/scratch/project_2010912/hannu/work/a3/eb731a5587a70df3a2881d8e50a37b/BismarkIndex/)'
FastQ format assumed (by default)
Input files to be analysed (in current folder '/scratch/project_2010912/hannu/work/d9/a345e5eae651450e6e1bc1e4f30f69'):
14_1_val_1.fq.gz
14_2_val_2.fq.gz
Library is assumed to be strand-specific (directional), alignments to strands complementary to the original top or bottom strands will be ignored (i.e. not performed!)
Summary of all aligner options: -q --score-min L,0,-0.2 --ignore-quals --no-mixed --no-discordant --dovetail --maxins 500
Running Bismark Parallel version. Number of parallel instances to be spawned: 4
Current working directory is: /scratch/project_2010912/hannu/work/d9/a345e5eae651450e6e1bc1e4f30f69
Now reading in and storing sequence information of the genome specified in: /scratch/project_2010912/hannu/work/a3/eb731a5587a70df3a2881d8e50a37b/BismarkIndex/
Failed to read from sequence file salmon_major_chromosomes.fasta No such file or directory
Work dir:
/scratch/project_2010912/hannu/work/d9/a345e5eae651450e6e1bc1e4f30f69
Tip: you can replicate the issue by changing to the process work dir and entering the command `bash .command.run`
$ nextflow run nf-core/methylseq --input /scratch/project_2010912/hannu/sample_list_test.csv --fasta /scratch/project_2010912/hannu/salmon_major_chromosomes.fasta --save_reference --outdir /scratch/project_2010912/methylseq_output --multiqc_title test_report --aligner bwameth
Workflow execution completed unsuccessfully!
The exit status of the task that caused the workflow execution to fail was: null.
The full error message was:
Error executing process > 'NFCORE_METHYLSEQ:METHYLSEQ:PREPARE_GENOME:SAMTOOLS_FAIDX'
Caused by:
Not a valid path value type: groovyx.gpars.dataflow.DataflowVariable (DataflowVariable(value=/scratch/project_2010912/hannu/salmon_major_chromosomes.fasta))
Tip: when you have fixed the problem you can continue the execution adding the option `-resume` to the run command line
The second bwameth
run in https://github.com/nf-core/methylseq/issues/305#issuecomment-2275391076 seems to be throwing a different error, I think that one is unrelated (but should also be looked into 🙈 )
that bug with bwameth has been fixed and runs successfully
nextflow run main.nf -profile test,docker --outdir results --aligner bwameth --save_reference true --fasta genome.fa --fasta_index genome.fa.fai
------------------------------------------------------
executor > local (54)
[24/3632c2] process > NFCORE_METHYLSEQ:PREPARE_GENOME:BWAMETH_INDEX (bwameth/genome.fa) [100%] 1 of 1 ✔
[- ] process > NFCORE_METHYLSEQ:METHYLSEQ:CAT_FASTQ -
[b0/c071bd] process > NFCORE_METHYLSEQ:METHYLSEQ:FASTQC (SRR389222_sub2) [100%] 4 of 4 ✔
[24/247a53] process > NFCORE_METHYLSEQ:METHYLSEQ:TRIMGALORE (Ecoli_10K_methylated) [100%] 4 of 4 ✔
[6c/b6e75d] process > NFCORE_METHYLSEQ:METHYLSEQ:BWAMETH:BWAMETH_ALIGN (Ecoli_10K_methylated) [100%] 4 of 4 ✔
[43/373773] process > NFCORE_METHYLSEQ:METHYLSEQ:BWAMETH:SAMTOOLS_SORT (SRR389222_sub1) [100%] 4 of 4 ✔
[60/68cf53] process > NFCORE_METHYLSEQ:METHYLSEQ:BWAMETH:SAMTOOLS_INDEX_ALIGNMENTS (SRR389222_sub1) [100%] 4 of 4 ✔
[55/796b13] process > NFCORE_METHYLSEQ:METHYLSEQ:BWAMETH:SAMTOOLS_FLAGSTAT (Ecoli_10K_methylated) [100%] 4 of 4 ✔
[21/57d8b3] process > NFCORE_METHYLSEQ:METHYLSEQ:BWAMETH:SAMTOOLS_STATS (SRR389222_sub1) [100%] 4 of 4 ✔
[81/cc9971] process > NFCORE_METHYLSEQ:METHYLSEQ:BWAMETH:PICARD_MARKDUPLICATES (SRR389222_sub3) [100%] 4 of 4 ✔
[9d/260d0d] process > NFCORE_METHYLSEQ:METHYLSEQ:BWAMETH:SAMTOOLS_INDEX_DEDUPLICATED (SRR389222_sub3) [100%] 4 of 4 ✔
[52/b2c091] process > NFCORE_METHYLSEQ:METHYLSEQ:BWAMETH:METHYLDACKEL_EXTRACT (SRR389222_sub3) [100%] 4 of 4 ✔
[9e/3c93b3] process > NFCORE_METHYLSEQ:METHYLSEQ:BWAMETH:METHYLDACKEL_MBIAS (SRR389222_sub3) [100%] 4 of 4 ✔
[55/da17da] process > NFCORE_METHYLSEQ:METHYLSEQ:QUALIMAP_BAMQC (Ecoli_10K_methylated) [100%] 4 of 4 ✔
[f9/5315b1] process > NFCORE_METHYLSEQ:METHYLSEQ:PRESEQ_LCEXTRAP (SRR389222_sub2) [100%] 4 of 4, failed: 4 ✔
[4f/05f969] process > NFCORE_METHYLSEQ:METHYLSEQ:MULTIQC [100%] 1 of 1 ✔
-[nf-core/methylseq] Pipeline completed successfully, but with errored process(es) -
Completed at: 03-Oct-2024 05:00:08
Duration : 3m 19s
CPU hours : 0.1 (6.6% failed)
Succeeded : 50
Ignored : 4
Failed : 4
Description of the bug
Hi - thanks for all your work on the pipeline.
I have a recurrent issue which is easy to overcome but I presume is caused by a bug somewhere. When I specify a
.fasta
reference using--fasta
, I'm finding that the pipeline fails at the beginning of alignment because the generated Bismark index contains an empty version of the supplied.fasta
file - see error message below:Inside the work dir,
BismarkIndex/
contains a correctly named, empty.fasta
. I overcome this by copying the original.fasta
into the working directory and restarting the pipeline with-resume
. Not a major issue but it took me a little while to figure it out.Thanks
Patrick
Command used and terminal output
System information
Methylseq V2.3.0 Script name main.nf Script ID d420d96c87e85cb9eb0749a6d4f01610 Workflow session 2fbd7f38-7eaf-41c1-967c-8bdef68bdc9d Workflow profile standard Nextflow version version 22.10.1, build 5828 (27-10-2022 16:58 UTC) Executor: Slurm Container engine: Singularity OS: Unix