nf-core/mag - Githubissues

Input specifications

To skip the qc and assembly step we can provide the assemblies and reads used to generate the assemblies in a CSV file together with a CSV file specifying the reads. The assembly CSV should be called with --assembly_input and the reads to be called with --input. Furthermore, these columns must be specified for the assembly csv id,group,assembler,fasta with one of these values for assembler: MEGAHIT, SPAdes, or SPAdesHybrid. The reads csv file must use these columns: sample,group,short_reads_1,short_reads_2,long_reads. Note that these files must contain the headers

Example: samplesheet-assembly.csv

id,group,assembler,fasta
sample1,0,MEGAHIT,MEGAHIT-sample1.contigs.fa.gz
sample1,0,SPAdes,SPAdes-sample1.fasta.gz
sample2,0,MEGAHIT,MEGAHIT-sample2.contigs.fa.gz
sample2,0,SPAdes,SPAdes-sample2.contigs.fasta.gz
sample3,1,MEGAHIT,MEGAHIT-sample3.contigs.fa.gz
sample3,1,SPAdes,SPAdes-sample3.contigs.fasta.gz

Example: samplesheet-reads.csv

sample,group,short_reads_1,short_reads_2,long_reads
sample1,0,data/sample1_R1.fastq.gz,data/sample1_R2.fastq.gz,
sample2,0,data/sample2_R1.fastq.gz,data/sample2_R2.fastq.gz,
sample3,1,data/sample3_R1.fastq.gz,data/sample3_R2.fastq.gz,

Running the pipeline

Example command:

nextflow run nf-core/mag --input samplesheet.csv --outdir <OUTDIR> -profile docker

Updating the pipeline

Nextflow automatically pulls the pipeline code and stores it as a cached version. To make sure that you’re running the latest version of the pipeline, make sure that you regularly update the cached version of the pipeline: nextflow pull nf-core/mag

Reproducibily

Specify the pipeline version using -r version. This will be logged in reports so one can always check what versions of the different tools were used.

MetaBAT2 is run by default with a fixed seed set in the pipeline.

To keep BUSCO reproducible, download a lineage beforehand as they are frequently updated and old versions are not always easily accessible.

For the taxonomic bin classification with CAT, when running the pipeline with --cat_db_generate the parameter --save_cat_db can be used to also save the generated database to allow reproducibility in future runs. Note that when specifying a pre-built database with --cat_db, currently the database can not be saved.

When it comes to visualizing taxonomic data using Krona, you have the option to provide a taxonomy file, such as taxonomy.tab, using the --krona_db parameter. If you don’t supply a taxonomy file, Krona is designed to automatically download the required taxonomy data for visualization.

The taxonomic classification of bins with GTDB-Tk is not guaranteed to be reproducible, since the placement of bins in the reference tree is non-deterministic. However, the authors of the GTDB-Tk article examined the reproducibility on a set of 100 genomes across 50 trials and did not observe any difference (see https://doi.org/10.1093/bioinformatics/btz848).

Note on bin refinement

DAS Tool may not work as enough single-copy genes were not recovered. One can change the scoring threshold using -refine_bins_dastool_threshold, which will modify the scoring threshold defined in the DAS Tool publication.

Setting everything up

As with bioinformatics, programs are not ready to go from the get go.

UPPMAX does not update the nf-core module anymore. Hence, the nf-core/mag pipeline is 2.3.2 while the latest is 3.0.0 and the nf-core tools is 2.6 while latest is 2.14.1. Furthermore, Nextflow is also outdated as i got this error Nextflow version 22.10.2 does not match workflow required version: >=23.04.0. Noticed this when i tried running the test profile and i ran into errors that were fixed in the later versions.

To fix this, i created a nf-core mamba environment and installed Nextflow and nf-core.

mamba create -n nf-core python=3.12 nf-core nextflow

Now, with the .bashrc edits (#37) i activated the environment and ran the test profile together with the uppmax profile; SUCCESS!.

mamba activate nf-core
mkdir nf-core-mag-test
cd nf-core-mag-test

# Run the workflow
nextflow run nf-core/mag -profile test,uppmax --project naiss2024-5-1 --outdir .

It is running now and creating slurm jobs.

Errors from UPPMAX modules

This is the code i rand to get the errors mentioned above.

module load nf-core/latest nf-core-pipelines/latest

# Path to were all pipelines are stored
echo $NF_CORE_PIPELINES
/sw/bioinfo/nf-core-pipelines/latest/rackham

tree -L 2 $NF_CORE_PIPELINES -I 'singularity_cache_dir'
├── mag
│   ├── 1.0.0
│   ├── 1.1.0
│   ├── 1.1.1
│   ├── 1.1.2
│   ├── 1.2.0
│   ├── 2.0.0
│   ├── 2.1.0
│   ├── 2.1.1
│   ├── 2.2.0
│   ├── 2.2.1
│   ├── 2.3.0
│   ├── 2.3.1
│   ├── 2.3.2
│   └── dev

Test of the pipeline

# Create output dir
mkdir nf-core-mag-test

# Load the modules
module load bioinfo-tools Nextflow/22.10.1 nf-core-pipelines/latest

# Set the variable for NXF_HOME
export NXF_HOME=/crex/proj/snic2020-6-222/Projects/Tconura/working/Andre/.nextflow

# Lets give it a test
nextflow run $NF_CORE_PIPELINES/mag/2.3.2/workflow -profile test,uppmax --project naiss2024-5-1 --outdir nf-core-mag-test/
N E X T F L O W  ~  version 22.10.2
Launching `/sw/bioinfo/nf-core-pipelines/latest/rackham/mag/2.3.2/workflow/main.nf` [exotic_venter] DSL2 - revision: 8f9adbafb4

------------------------------------------------------
                                        ,--./,-.
        ___     __   __   __   ___     /,-._.--~'
  |\ | |__  __ /  ` /  \ |__) |__         }  {
  | \| |       \__, \__/ |  \ |___     \`-._,-`-,
                                        `._,._,'
  nf-core/mag v2.3.2
------------------------------------------------------
Core Nextflow options
  runName                    : exotic_venter
  containerEngine            : singularity
  launchDir                  : /crex/proj/snic2020-6-222/Projects/Tconura/working/Andre/CONURA_WGS
  workDir                    : /crex/proj/snic2020-6-222/Projects/Tconura/working/Andre/CONURA_WGS/work
  projectDir                 : /sw/bioinfo/nf-core-pipelines/latest/rackham/mag/2.3.2/workflow
  userName                   : andbou
  profile                    : test,uppmax
  configFiles                : /sw/bioinfo/nf-core-pipelines/latest/rackham/mag/2.3.2/workflow/nextflow.config

Input/output options
  input                      : https://raw.githubusercontent.com/nf-core/test-datasets/mag/samplesheets/samplesheet.csv
  outdir                     : nf-core-mag-test/

Reference genome options
  igenomes_base              : /sw/data/igenomes/

Institutional config options
  custom_config_base         : /sw/bioinfo/nf-core-pipelines/latest/rackham/mag/2.3.2/workflow/../configs/
  config_profile_name        : Test profile
  config_profile_description : Minimal test dataset to check pipeline function
  config_profile_contact     : Phil Ewels (@ewels)
  config_profile_url         : https://www.uppmax.uu.se/

Max job request options
  max_cpus                   : 2
  max_memory                 : 6.GB
  max_time                   : 6.h

Quality control for short reads options
  phix_reference             : /sw/bioinfo/nf-core-pipelines/latest/rackham/mag/2.3.2/workflow/assets/data/GCA_002596845.1_ASM259684v1_genomic.fna.gz

Quality control for long reads options
  lambda_reference           : /sw/bioinfo/nf-core-pipelines/latest/rackham/mag/2.3.2/workflow/assets/data/GCA_000840245.1_ViralProj14204_genomic.fna.gz

Taxonomic profiling options
  centrifuge_db              : https://raw.githubusercontent.com/nf-core/test-datasets/mag/test_data/minigut_cf.tar.gz
  kraken2_db                 : https://raw.githubusercontent.com/nf-core/test-datasets/mag/test_data/minigut_kraken.tgz
  skip_krona                 : true
  gtdb                       : false

Binning options
  skip_concoct               : true
  min_length_unbinned_contigs: 1
  max_unbinned_contigs       : 2

Bin quality check options
  busco_reference            : https://busco-data.ezlab.org/v5/data/lineages/bacteria_odb10.2020-03-06.tar.gz
  busco_clean                : true
  checkm_db                  : null
  gunc_db                    : null

!! Only displaying parameters that differ from the pipeline defaults !!
------------------------------------------------------
If you use nf-core/mag for your analysis please cite:

* The pipeline publication
  https://doi.org/10.1093/nargab/lqac007

* The pipeline
  https://doi.org/10.5281/zenodo.3589527

* The nf-core framework
  https://doi.org/10.1038/s41587-020-0439-x

* Software dependencies
  https://github.com/nf-core/mag/blob/master/CITATIONS.md
------------------------------------------------------

WARN: Found unexpected parameters:
* --save_reference: true
- Ignore this warning: params.schema_ignore_params = "save_reference" 

Unable to read script: '/sw/bioinfo/nf-core-pipelines/latest/rackham/mag/2.3.2/workflow/./workflows/mag.nf' -- cause: https://busco-data.ezlab.org/v5/data/lineages/bacteria_odb10.2020-03-06.tar.gz

The test run was a SUCCESS!

-[nf-core/mag] Pipeline completed successfully, but with errored process(es) -
Completed at: 14-May-2024 12:22:36
Duration    : 1h 6m 18s
CPU hours   : 1.5 (0.2% failed)
Succeeded   : 161
Ignored     : 1
Failed      : 1

Running the program for CHST hybrid assembly

As we are going to run this for only one sample (CHST) i will manually write these .csv files.

Note from https://nf-co.re/mag/3.0.0/docs/usage: As long reads are only used for assembly, any long read fastq files listed in the reads CSV are ignored. Hence, i will ignore adding long reads to the .csv file for reads.

samplesheet-reads.csv

sample,group,short_reads_1,short_reads_2,long_reads
CHST,0,/crex/proj/snic2020-6-222/Projects/Tconura/working/Andre/CONURA_WGS/05-CLEAN-MERGED/CHST_R1-clean.fq.gz,/crex/proj/snic2020-6-222/Projects/Tconura/working/Andre/CONURA_WGS/05-CLEAN-MERGED/CHST_R2-clean.fq.gz

samplesheet-assembly.csv

id,group,assembler,fasta
CHST,0,SPAdesHybrid,/crex/proj/snic2020-6-222/Projects/Tconura/working/Andre/CONURA_WGS/06-ASSEMBLY/CHST/contigs.fasta

The parameters .yml file

input: /crex/proj/snic2020-6-222/Projects/Tconura/working/Andre/CONURA_WGS/doc/samplesheet-reads.csv
assembly_input: /crex/proj/snic2020-6-222/Projects/Tconura/working/Andre/CONURA_WGS/doc/samplesheet-assembly.csv
outdir: /crex/proj/snic2020-6-222/Projects/Tconura/working/Andre/CONURA_WGS/07-nf-core-mag/CHST
email: andbou95@gmail.com
multiqc_title: CHST
cat_db_generate: true
save_cat_db: true
busco_db: /crex/proj/snic2020-6-222/Projects/Tconura/working/Andre/databases/busco-db/bacteria_odb10/
save_busco_db: true
busco_clean: true
save_checkm_data: true
refine_bins_dastool: true

The command i ran

nextflow run nf-core/mag -r 3.0.0 -bg -profile uppmax -params-file ../doc/chst-mag-params.yml --project naiss2024-5-1 > chst-bg.log

As it turned out after multiple runs, busco and cat_db failed which caused the pipeline to fail. I could not resolve the error after checking issues and the slack channel. Busco and CAT can be run after.

Here is the updated .yaml that rendered a successful run.

input: /crex/proj/snic2020-6-222/Projects/Tconura/working/Andre/CONURA_WGS/doc/samplesheet-reads.csv
assembly_input: /crex/proj/snic2020-6-222/Projects/Tconura/working/Andre/CONURA_WGS/doc/samplesheet-assembly.csv
outdir: /crex/proj/snic2020-6-222/Projects/Tconura/working/Andre/CONURA_WGS/nf-core-mag/CHST
email: andbou95@gmail.com
multiqc_title: CHST
binqc_tool: checkm
save_checkm_data: true
refine_bins_dastool: true

Initial results

DASTool was able to render three bins of high quality. We see it has a high N50 and good genome size with two having completeness above 90%. However, bins from CONCOCT was not handled as they produce bins in a different format from MaxBin and MetaBAT2.

bin                                      ,bin_set                        ,unique_SCGs ,redundant_SCGs ,SCG_set  ,size    ,contigs ,N50    ,bin_score         ,SCG_completeness ,SCG_redundancy
SPAdesHybrid-MaxBin2Refined-CHST.001     ,SPAdesHybrid-MaxBin2-CHST.tsv  ,         50 ,             0 ,bacteria ,2245518 ,    136 ,145436 ,0.980392156862745 ,              98 ,             0
SPAdesHybrid-MetaBAT2Refined-CHST.3      ,SPAdesHybrid-MetaBAT2-CHST.tsv ,         46 ,             0 ,bacteria ,1082364 ,    113 , 16567 ,0.901960784313725 ,              90 ,             0
SPAdesHybrid-MaxBin2Refined-CHST.003_sub ,SPAdesHybrid-MaxBin2-CHST.tsv  ,         32 ,             2 ,bacteria ,2066527 ,    373 , 15407 ,0.570343137254902 ,              63 ,             4

From the GTBDK taxonomy classification, three bins were classified. One belonging to Erwinia and two belonging to Wolbachia pipientis.

SPAdesHybrid-MaxBin2-CHST.001.fa,d__Bacteria;p__Pseudomonadota;c__Gammaproteobacteria;o__Enterobacterales;f__Enterobacteriaceae;g__Erwinia;s
SPAdesHybrid-MaxBin2-CHST.002.fa,d__Bacteria;p__Pseudomonadota;c__Alphaproteobacteria;o__Rickettsiales;f__Anaplasmataceae;g__Wolbachia;s__Wolbachia pipientis
SPAdesHybrid-MetaBAT2-CHST.3.fa,d__Bacteria;p__Pseudomonadota;c__Alphaproteobacteria;o__Rickettsiales;f__Anaplasmataceae;g__Wolbachia;s__Wolbachia pipientis

In the COGE run, only one was found.

bin,bin_set,unique_SCGs,redundant_SCGs,SCG_set,size,contigs,N50,bin_score,SCG_completeness,SCG_redundancy
SPAdes-MetaBAT2Refined-COGE.3,SPAdes-MetaBAT2-COGE.tsv,50,0,bacteria,1447577,51,72858,0.980392156862745,98,0

Which was classified as a Erwinia spp.

SPAdes-MetaBAT2-COGE.3.fa       d__Bacteria;p__Pseudomonadota;c__Gammaproteobacteria;o__Enterobacterales;f__Enterobacteriaceae;g__Erwinia

Next, i will give anvio a go to get more control for each step but is fun getting some results!

Can use CheckM2 for refinement (better for novel species, uses machine learning).
Can use CONCOCT bins.
Better visualization.

ndreey / CONURA_WGS

nf-core/mag #38