Open ndreey opened 6 months ago
As with bioinformatics, programs are not ready to go from the get go.
UPPMAX does not update the nf-core module anymore. Hence, the nf-core/mag
pipeline is 2.3.2 while the latest is 3.0.0 and the nf-core
tools is 2.6 while latest is 2.14.1. Furthermore, Nextflow
is also outdated as i got this error Nextflow version 22.10.2 does not match workflow required version: >=23.04.0
. Noticed this when i tried running the test profile and i ran into errors that were fixed in the later versions.
To fix this, i created a nf-core mamba environment and installed Nextflow and nf-core.
mamba create -n nf-core python=3.12 nf-core nextflow
Now, with the .bashrc
edits (#37) i activated the environment and ran the test profile together with the uppmax profile; SUCCESS!.
mamba activate nf-core
mkdir nf-core-mag-test
cd nf-core-mag-test
# Run the workflow
nextflow run nf-core/mag -profile test,uppmax --project naiss2024-5-1 --outdir .
It is running now and creating slurm jobs.
This is the code i rand to get the errors mentioned above.
module load nf-core/latest nf-core-pipelines/latest
# Path to were all pipelines are stored
echo $NF_CORE_PIPELINES
/sw/bioinfo/nf-core-pipelines/latest/rackham
tree -L 2 $NF_CORE_PIPELINES -I 'singularity_cache_dir'
├── mag
│ ├── 1.0.0
│ ├── 1.1.0
│ ├── 1.1.1
│ ├── 1.1.2
│ ├── 1.2.0
│ ├── 2.0.0
│ ├── 2.1.0
│ ├── 2.1.1
│ ├── 2.2.0
│ ├── 2.2.1
│ ├── 2.3.0
│ ├── 2.3.1
│ ├── 2.3.2
│ └── dev
Test of the pipeline
# Create output dir
mkdir nf-core-mag-test
# Load the modules
module load bioinfo-tools Nextflow/22.10.1 nf-core-pipelines/latest
# Set the variable for NXF_HOME
export NXF_HOME=/crex/proj/snic2020-6-222/Projects/Tconura/working/Andre/.nextflow
# Lets give it a test
nextflow run $NF_CORE_PIPELINES/mag/2.3.2/workflow -profile test,uppmax --project naiss2024-5-1 --outdir nf-core-mag-test/
N E X T F L O W ~ version 22.10.2
Launching `/sw/bioinfo/nf-core-pipelines/latest/rackham/mag/2.3.2/workflow/main.nf` [exotic_venter] DSL2 - revision: 8f9adbafb4
------------------------------------------------------
,--./,-.
___ __ __ __ ___ /,-._.--~'
|\ | |__ __ / ` / \ |__) |__ } {
| \| | \__, \__/ | \ |___ \`-._,-`-,
`._,._,'
nf-core/mag v2.3.2
------------------------------------------------------
Core Nextflow options
runName : exotic_venter
containerEngine : singularity
launchDir : /crex/proj/snic2020-6-222/Projects/Tconura/working/Andre/CONURA_WGS
workDir : /crex/proj/snic2020-6-222/Projects/Tconura/working/Andre/CONURA_WGS/work
projectDir : /sw/bioinfo/nf-core-pipelines/latest/rackham/mag/2.3.2/workflow
userName : andbou
profile : test,uppmax
configFiles : /sw/bioinfo/nf-core-pipelines/latest/rackham/mag/2.3.2/workflow/nextflow.config
Input/output options
input : https://raw.githubusercontent.com/nf-core/test-datasets/mag/samplesheets/samplesheet.csv
outdir : nf-core-mag-test/
Reference genome options
igenomes_base : /sw/data/igenomes/
Institutional config options
custom_config_base : /sw/bioinfo/nf-core-pipelines/latest/rackham/mag/2.3.2/workflow/../configs/
config_profile_name : Test profile
config_profile_description : Minimal test dataset to check pipeline function
config_profile_contact : Phil Ewels (@ewels)
config_profile_url : https://www.uppmax.uu.se/
Max job request options
max_cpus : 2
max_memory : 6.GB
max_time : 6.h
Quality control for short reads options
phix_reference : /sw/bioinfo/nf-core-pipelines/latest/rackham/mag/2.3.2/workflow/assets/data/GCA_002596845.1_ASM259684v1_genomic.fna.gz
Quality control for long reads options
lambda_reference : /sw/bioinfo/nf-core-pipelines/latest/rackham/mag/2.3.2/workflow/assets/data/GCA_000840245.1_ViralProj14204_genomic.fna.gz
Taxonomic profiling options
centrifuge_db : https://raw.githubusercontent.com/nf-core/test-datasets/mag/test_data/minigut_cf.tar.gz
kraken2_db : https://raw.githubusercontent.com/nf-core/test-datasets/mag/test_data/minigut_kraken.tgz
skip_krona : true
gtdb : false
Binning options
skip_concoct : true
min_length_unbinned_contigs: 1
max_unbinned_contigs : 2
Bin quality check options
busco_reference : https://busco-data.ezlab.org/v5/data/lineages/bacteria_odb10.2020-03-06.tar.gz
busco_clean : true
checkm_db : null
gunc_db : null
!! Only displaying parameters that differ from the pipeline defaults !!
------------------------------------------------------
If you use nf-core/mag for your analysis please cite:
* The pipeline publication
https://doi.org/10.1093/nargab/lqac007
* The pipeline
https://doi.org/10.5281/zenodo.3589527
* The nf-core framework
https://doi.org/10.1038/s41587-020-0439-x
* Software dependencies
https://github.com/nf-core/mag/blob/master/CITATIONS.md
------------------------------------------------------
WARN: Found unexpected parameters:
* --save_reference: true
- Ignore this warning: params.schema_ignore_params = "save_reference"
Unable to read script: '/sw/bioinfo/nf-core-pipelines/latest/rackham/mag/2.3.2/workflow/./workflows/mag.nf' -- cause: https://busco-data.ezlab.org/v5/data/lineages/bacteria_odb10.2020-03-06.tar.gz
-[nf-core/mag] Pipeline completed successfully, but with errored process(es) -
Completed at: 14-May-2024 12:22:36
Duration : 1h 6m 18s
CPU hours : 1.5 (0.2% failed)
Succeeded : 161
Ignored : 1
Failed : 1
As we are going to run this for only one sample (CHST) i will manually write these .csv
files.
Note from https://nf-co.re/mag/3.0.0/docs/usage: As long reads are only used for assembly, any long read fastq files listed in the reads CSV are ignored. Hence, i will ignore adding long reads to the .csv file for reads.
samplesheet-reads.csv
sample,group,short_reads_1,short_reads_2,long_reads
CHST,0,/crex/proj/snic2020-6-222/Projects/Tconura/working/Andre/CONURA_WGS/05-CLEAN-MERGED/CHST_R1-clean.fq.gz,/crex/proj/snic2020-6-222/Projects/Tconura/working/Andre/CONURA_WGS/05-CLEAN-MERGED/CHST_R2-clean.fq.gz
samplesheet-assembly.csv
id,group,assembler,fasta
CHST,0,SPAdesHybrid,/crex/proj/snic2020-6-222/Projects/Tconura/working/Andre/CONURA_WGS/06-ASSEMBLY/CHST/contigs.fasta
The parameters .yml
file
input: /crex/proj/snic2020-6-222/Projects/Tconura/working/Andre/CONURA_WGS/doc/samplesheet-reads.csv
assembly_input: /crex/proj/snic2020-6-222/Projects/Tconura/working/Andre/CONURA_WGS/doc/samplesheet-assembly.csv
outdir: /crex/proj/snic2020-6-222/Projects/Tconura/working/Andre/CONURA_WGS/07-nf-core-mag/CHST
email: andbou95@gmail.com
multiqc_title: CHST
cat_db_generate: true
save_cat_db: true
busco_db: /crex/proj/snic2020-6-222/Projects/Tconura/working/Andre/databases/busco-db/bacteria_odb10/
save_busco_db: true
busco_clean: true
save_checkm_data: true
refine_bins_dastool: true
The command i ran
nextflow run nf-core/mag -r 3.0.0 -bg -profile uppmax -params-file ../doc/chst-mag-params.yml --project naiss2024-5-1 > chst-bg.log
As it turned out after multiple runs, busco
and cat_db
failed which caused the pipeline to fail. I could not resolve the error after checking issues and the slack channel. Busco and CAT can be run after.
Here is the updated .yaml
that rendered a successful run.
input: /crex/proj/snic2020-6-222/Projects/Tconura/working/Andre/CONURA_WGS/doc/samplesheet-reads.csv
assembly_input: /crex/proj/snic2020-6-222/Projects/Tconura/working/Andre/CONURA_WGS/doc/samplesheet-assembly.csv
outdir: /crex/proj/snic2020-6-222/Projects/Tconura/working/Andre/CONURA_WGS/nf-core-mag/CHST
email: andbou95@gmail.com
multiqc_title: CHST
binqc_tool: checkm
save_checkm_data: true
refine_bins_dastool: true
DASTool
was able to render three bins of high quality. We see it has a high N50 and good genome size with two having completeness above 90%. However, bins from CONCOCT
was not handled as they produce bins in a different format from MaxBin
and MetaBAT2
.
bin ,bin_set ,unique_SCGs ,redundant_SCGs ,SCG_set ,size ,contigs ,N50 ,bin_score ,SCG_completeness ,SCG_redundancy
SPAdesHybrid-MaxBin2Refined-CHST.001 ,SPAdesHybrid-MaxBin2-CHST.tsv , 50 , 0 ,bacteria ,2245518 , 136 ,145436 ,0.980392156862745 , 98 , 0
SPAdesHybrid-MetaBAT2Refined-CHST.3 ,SPAdesHybrid-MetaBAT2-CHST.tsv , 46 , 0 ,bacteria ,1082364 , 113 , 16567 ,0.901960784313725 , 90 , 0
SPAdesHybrid-MaxBin2Refined-CHST.003_sub ,SPAdesHybrid-MaxBin2-CHST.tsv , 32 , 2 ,bacteria ,2066527 , 373 , 15407 ,0.570343137254902 , 63 , 4
From the GTBDK
taxonomy classification, three bins were classified. One belonging to Erwinia and two belonging to Wolbachia pipientis.
SPAdesHybrid-MaxBin2-CHST.001.fa,d__Bacteria;p__Pseudomonadota;c__Gammaproteobacteria;o__Enterobacterales;f__Enterobacteriaceae;g__Erwinia;s
SPAdesHybrid-MaxBin2-CHST.002.fa,d__Bacteria;p__Pseudomonadota;c__Alphaproteobacteria;o__Rickettsiales;f__Anaplasmataceae;g__Wolbachia;s__Wolbachia pipientis
SPAdesHybrid-MetaBAT2-CHST.3.fa,d__Bacteria;p__Pseudomonadota;c__Alphaproteobacteria;o__Rickettsiales;f__Anaplasmataceae;g__Wolbachia;s__Wolbachia pipientis
In the COGE run, only one was found.
bin,bin_set,unique_SCGs,redundant_SCGs,SCG_set,size,contigs,N50,bin_score,SCG_completeness,SCG_redundancy
SPAdes-MetaBAT2Refined-COGE.3,SPAdes-MetaBAT2-COGE.tsv,50,0,bacteria,1447577,51,72858,0.980392156862745,98,0
Which was classified as a Erwinia spp.
SPAdes-MetaBAT2-COGE.3.fa d__Bacteria;p__Pseudomonadota;c__Gammaproteobacteria;o__Enterobacterales;f__Enterobacteriaceae;g__Erwinia
Next, i will give anvio
a go to get more control for each step but is fun getting some results!
CheckM2
for refinement (better for novel species, uses machine learning).CONCOCT
bins.
Input specifications
To skip the qc and assembly step we can provide the assemblies and reads used to generate the assemblies in a CSV file together with a CSV file specifying the reads. The assembly CSV should be called with
--assembly_input
and the reads to be called with--input
. Furthermore, these columns must be specified for the assembly csvid,group,assembler,fasta
with one of these values for assembler: MEGAHIT, SPAdes, or SPAdesHybrid. The reads csv file must use these columns:sample,group,short_reads_1,short_reads_2,long_reads
. Note that these files must contain the headersExample: samplesheet-assembly.csv
Example: samplesheet-reads.csv
Running the pipeline
Example command:
Updating the pipeline
Nextflow automatically pulls the pipeline code and stores it as a cached version. To make sure that you’re running the latest version of the pipeline, make sure that you regularly update the cached version of the pipeline:
nextflow pull nf-core/mag
Reproducibily
Specify the pipeline version using
-r version
. This will be logged in reports so one can always check what versions of the different tools were used.MetaBAT2
is run by default with a fixed seed set in the pipeline.To keep
BUSCO
reproducible, download a lineage beforehand as they are frequently updated and old versions are not always easily accessible.For the taxonomic bin classification with CAT, when running the pipeline with
--cat_db_generate
the parameter--save_cat_db
can be used to also save the generated database to allow reproducibility in future runs. Note that when specifying a pre-built database with--cat_db
, currently the database can not be saved.When it comes to visualizing taxonomic data using Krona, you have the option to provide a taxonomy file, such as taxonomy.tab, using the --krona_db parameter. If you don’t supply a taxonomy file, Krona is designed to automatically download the required taxonomy data for visualization.
The taxonomic classification of bins with GTDB-Tk is not guaranteed to be reproducible, since the placement of bins in the reference tree is non-deterministic. However, the authors of the GTDB-Tk article examined the reproducibility on a set of 100 genomes across 50 trials and did not observe any difference (see https://doi.org/10.1093/bioinformatics/btz848).
Note on bin refinement
DAS Tool
may not work as enough single-copy genes were not recovered. One can change the scoring threshold using-refine_bins_dastool_threshold
, which will modify the scoring threshold defined in the DAS Tool publication.