ndreey / CONURA_WGS

Metagenomic analysis on whole genome sequencing data from Tephritis conura (IN PROGRESS)
0 stars 0 forks source link

nf-core/mag #38

Open ndreey opened 6 months ago

ndreey commented 6 months ago

Input specifications

To skip the qc and assembly step we can provide the assemblies and reads used to generate the assemblies in a CSV file together with a CSV file specifying the reads. The assembly CSV should be called with --assembly_input and the reads to be called with --input. Furthermore, these columns must be specified for the assembly csv id,group,assembler,fasta with one of these values for assembler: MEGAHIT, SPAdes, or SPAdesHybrid. The reads csv file must use these columns: sample,group,short_reads_1,short_reads_2,long_reads. Note that these files must contain the headers

Example: samplesheet-assembly.csv

id,group,assembler,fasta
sample1,0,MEGAHIT,MEGAHIT-sample1.contigs.fa.gz
sample1,0,SPAdes,SPAdes-sample1.fasta.gz
sample2,0,MEGAHIT,MEGAHIT-sample2.contigs.fa.gz
sample2,0,SPAdes,SPAdes-sample2.contigs.fasta.gz
sample3,1,MEGAHIT,MEGAHIT-sample3.contigs.fa.gz
sample3,1,SPAdes,SPAdes-sample3.contigs.fasta.gz

Example: samplesheet-reads.csv

sample,group,short_reads_1,short_reads_2,long_reads
sample1,0,data/sample1_R1.fastq.gz,data/sample1_R2.fastq.gz,
sample2,0,data/sample2_R1.fastq.gz,data/sample2_R2.fastq.gz,
sample3,1,data/sample3_R1.fastq.gz,data/sample3_R2.fastq.gz,

Running the pipeline

Example command:

nextflow run nf-core/mag --input samplesheet.csv --outdir <OUTDIR> -profile docker

Updating the pipeline

Nextflow automatically pulls the pipeline code and stores it as a cached version. To make sure that you’re running the latest version of the pipeline, make sure that you regularly update the cached version of the pipeline: nextflow pull nf-core/mag

Reproducibily

Specify the pipeline version using -r version. This will be logged in reports so one can always check what versions of the different tools were used.

MetaBAT2 is run by default with a fixed seed set in the pipeline.

To keep BUSCO reproducible, download a lineage beforehand as they are frequently updated and old versions are not always easily accessible.

For the taxonomic bin classification with CAT, when running the pipeline with --cat_db_generate the parameter --save_cat_db can be used to also save the generated database to allow reproducibility in future runs. Note that when specifying a pre-built database with --cat_db, currently the database can not be saved.

When it comes to visualizing taxonomic data using Krona, you have the option to provide a taxonomy file, such as taxonomy.tab, using the --krona_db parameter. If you don’t supply a taxonomy file, Krona is designed to automatically download the required taxonomy data for visualization.

The taxonomic classification of bins with GTDB-Tk is not guaranteed to be reproducible, since the placement of bins in the reference tree is non-deterministic. However, the authors of the GTDB-Tk article examined the reproducibility on a set of 100 genomes across 50 trials and did not observe any difference (see https://doi.org/10.1093/bioinformatics/btz848).

Note on bin refinement

DAS Tool may not work as enough single-copy genes were not recovered. One can change the scoring threshold using -refine_bins_dastool_threshold, which will modify the scoring threshold defined in the DAS Tool publication.

ndreey commented 6 months ago

Setting everything up

As with bioinformatics, programs are not ready to go from the get go.

UPPMAX does not update the nf-core module anymore. Hence, the nf-core/mag pipeline is 2.3.2 while the latest is 3.0.0 and the nf-core tools is 2.6 while latest is 2.14.1. Furthermore, Nextflow is also outdated as i got this error Nextflow version 22.10.2 does not match workflow required version: >=23.04.0. Noticed this when i tried running the test profile and i ran into errors that were fixed in the later versions.

To fix this, i created a nf-core mamba environment and installed Nextflow and nf-core.

mamba create -n nf-core python=3.12 nf-core nextflow

Now, with the .bashrc edits (#37) i activated the environment and ran the test profile together with the uppmax profile; SUCCESS!.

mamba activate nf-core
mkdir nf-core-mag-test
cd nf-core-mag-test

# Run the workflow
nextflow run nf-core/mag -profile test,uppmax --project naiss2024-5-1 --outdir .

It is running now and creating slurm jobs.

Errors from UPPMAX modules

This is the code i rand to get the errors mentioned above.

module load nf-core/latest nf-core-pipelines/latest

# Path to were all pipelines are stored
echo $NF_CORE_PIPELINES
/sw/bioinfo/nf-core-pipelines/latest/rackham

tree -L 2 $NF_CORE_PIPELINES -I 'singularity_cache_dir'
├── mag
│   ├── 1.0.0
│   ├── 1.1.0
│   ├── 1.1.1
│   ├── 1.1.2
│   ├── 1.2.0
│   ├── 2.0.0
│   ├── 2.1.0
│   ├── 2.1.1
│   ├── 2.2.0
│   ├── 2.2.1
│   ├── 2.3.0
│   ├── 2.3.1
│   ├── 2.3.2
│   └── dev

Test of the pipeline

# Create output dir
mkdir nf-core-mag-test

# Load the modules
module load bioinfo-tools Nextflow/22.10.1 nf-core-pipelines/latest

# Set the variable for NXF_HOME
export NXF_HOME=/crex/proj/snic2020-6-222/Projects/Tconura/working/Andre/.nextflow

# Lets give it a test
nextflow run $NF_CORE_PIPELINES/mag/2.3.2/workflow -profile test,uppmax --project naiss2024-5-1 --outdir nf-core-mag-test/
N E X T F L O W  ~  version 22.10.2
Launching `/sw/bioinfo/nf-core-pipelines/latest/rackham/mag/2.3.2/workflow/main.nf` [exotic_venter] DSL2 - revision: 8f9adbafb4

------------------------------------------------------
                                        ,--./,-.
        ___     __   __   __   ___     /,-._.--~'
  |\ | |__  __ /  ` /  \ |__) |__         }  {
  | \| |       \__, \__/ |  \ |___     \`-._,-`-,
                                        `._,._,'
  nf-core/mag v2.3.2
------------------------------------------------------
Core Nextflow options
  runName                    : exotic_venter
  containerEngine            : singularity
  launchDir                  : /crex/proj/snic2020-6-222/Projects/Tconura/working/Andre/CONURA_WGS
  workDir                    : /crex/proj/snic2020-6-222/Projects/Tconura/working/Andre/CONURA_WGS/work
  projectDir                 : /sw/bioinfo/nf-core-pipelines/latest/rackham/mag/2.3.2/workflow
  userName                   : andbou
  profile                    : test,uppmax
  configFiles                : /sw/bioinfo/nf-core-pipelines/latest/rackham/mag/2.3.2/workflow/nextflow.config

Input/output options
  input                      : https://raw.githubusercontent.com/nf-core/test-datasets/mag/samplesheets/samplesheet.csv
  outdir                     : nf-core-mag-test/

Reference genome options
  igenomes_base              : /sw/data/igenomes/

Institutional config options
  custom_config_base         : /sw/bioinfo/nf-core-pipelines/latest/rackham/mag/2.3.2/workflow/../configs/
  config_profile_name        : Test profile
  config_profile_description : Minimal test dataset to check pipeline function
  config_profile_contact     : Phil Ewels (@ewels)
  config_profile_url         : https://www.uppmax.uu.se/

Max job request options
  max_cpus                   : 2
  max_memory                 : 6.GB
  max_time                   : 6.h

Quality control for short reads options
  phix_reference             : /sw/bioinfo/nf-core-pipelines/latest/rackham/mag/2.3.2/workflow/assets/data/GCA_002596845.1_ASM259684v1_genomic.fna.gz

Quality control for long reads options
  lambda_reference           : /sw/bioinfo/nf-core-pipelines/latest/rackham/mag/2.3.2/workflow/assets/data/GCA_000840245.1_ViralProj14204_genomic.fna.gz

Taxonomic profiling options
  centrifuge_db              : https://raw.githubusercontent.com/nf-core/test-datasets/mag/test_data/minigut_cf.tar.gz
  kraken2_db                 : https://raw.githubusercontent.com/nf-core/test-datasets/mag/test_data/minigut_kraken.tgz
  skip_krona                 : true
  gtdb                       : false

Binning options
  skip_concoct               : true
  min_length_unbinned_contigs: 1
  max_unbinned_contigs       : 2

Bin quality check options
  busco_reference            : https://busco-data.ezlab.org/v5/data/lineages/bacteria_odb10.2020-03-06.tar.gz
  busco_clean                : true
  checkm_db                  : null
  gunc_db                    : null

!! Only displaying parameters that differ from the pipeline defaults !!
------------------------------------------------------
If you use nf-core/mag for your analysis please cite:

* The pipeline publication
  https://doi.org/10.1093/nargab/lqac007

* The pipeline
  https://doi.org/10.5281/zenodo.3589527

* The nf-core framework
  https://doi.org/10.1038/s41587-020-0439-x

* Software dependencies
  https://github.com/nf-core/mag/blob/master/CITATIONS.md
------------------------------------------------------

WARN: Found unexpected parameters:
* --save_reference: true
- Ignore this warning: params.schema_ignore_params = "save_reference" 

Unable to read script: '/sw/bioinfo/nf-core-pipelines/latest/rackham/mag/2.3.2/workflow/./workflows/mag.nf' -- cause: https://busco-data.ezlab.org/v5/data/lineages/bacteria_odb10.2020-03-06.tar.gz
ndreey commented 6 months ago

The test run was a SUCCESS!

-[nf-core/mag] Pipeline completed successfully, but with errored process(es) -
Completed at: 14-May-2024 12:22:36
Duration    : 1h 6m 18s
CPU hours   : 1.5 (0.2% failed)
Succeeded   : 161
Ignored     : 1
Failed      : 1
ndreey commented 6 months ago

Running the program for CHST hybrid assembly

As we are going to run this for only one sample (CHST) i will manually write these .csv files.

Note from https://nf-co.re/mag/3.0.0/docs/usage: As long reads are only used for assembly, any long read fastq files listed in the reads CSV are ignored. Hence, i will ignore adding long reads to the .csv file for reads.

samplesheet-reads.csv

sample,group,short_reads_1,short_reads_2,long_reads
CHST,0,/crex/proj/snic2020-6-222/Projects/Tconura/working/Andre/CONURA_WGS/05-CLEAN-MERGED/CHST_R1-clean.fq.gz,/crex/proj/snic2020-6-222/Projects/Tconura/working/Andre/CONURA_WGS/05-CLEAN-MERGED/CHST_R2-clean.fq.gz

samplesheet-assembly.csv

id,group,assembler,fasta
CHST,0,SPAdesHybrid,/crex/proj/snic2020-6-222/Projects/Tconura/working/Andre/CONURA_WGS/06-ASSEMBLY/CHST/contigs.fasta 

The parameters .yml file

input: /crex/proj/snic2020-6-222/Projects/Tconura/working/Andre/CONURA_WGS/doc/samplesheet-reads.csv
assembly_input: /crex/proj/snic2020-6-222/Projects/Tconura/working/Andre/CONURA_WGS/doc/samplesheet-assembly.csv
outdir: /crex/proj/snic2020-6-222/Projects/Tconura/working/Andre/CONURA_WGS/07-nf-core-mag/CHST
email: andbou95@gmail.com
multiqc_title: CHST
cat_db_generate: true
save_cat_db: true
busco_db: /crex/proj/snic2020-6-222/Projects/Tconura/working/Andre/databases/busco-db/bacteria_odb10/
save_busco_db: true
busco_clean: true
save_checkm_data: true
refine_bins_dastool: true

The command i ran

nextflow run nf-core/mag -r 3.0.0 -bg -profile uppmax -params-file ../doc/chst-mag-params.yml --project naiss2024-5-1 > chst-bg.log
ndreey commented 6 months ago

As it turned out after multiple runs, busco and cat_db failed which caused the pipeline to fail. I could not resolve the error after checking issues and the slack channel. Busco and CAT can be run after.

Here is the updated .yaml that rendered a successful run.

input: /crex/proj/snic2020-6-222/Projects/Tconura/working/Andre/CONURA_WGS/doc/samplesheet-reads.csv
assembly_input: /crex/proj/snic2020-6-222/Projects/Tconura/working/Andre/CONURA_WGS/doc/samplesheet-assembly.csv
outdir: /crex/proj/snic2020-6-222/Projects/Tconura/working/Andre/CONURA_WGS/nf-core-mag/CHST
email: andbou95@gmail.com
multiqc_title: CHST
binqc_tool: checkm
save_checkm_data: true
refine_bins_dastool: true

Initial results

DASTool was able to render three bins of high quality. We see it has a high N50 and good genome size with two having completeness above 90%. However, bins from CONCOCT was not handled as they produce bins in a different format from MaxBin and MetaBAT2.

bin                                      ,bin_set                        ,unique_SCGs ,redundant_SCGs ,SCG_set  ,size    ,contigs ,N50    ,bin_score         ,SCG_completeness ,SCG_redundancy
SPAdesHybrid-MaxBin2Refined-CHST.001     ,SPAdesHybrid-MaxBin2-CHST.tsv  ,         50 ,             0 ,bacteria ,2245518 ,    136 ,145436 ,0.980392156862745 ,              98 ,             0
SPAdesHybrid-MetaBAT2Refined-CHST.3      ,SPAdesHybrid-MetaBAT2-CHST.tsv ,         46 ,             0 ,bacteria ,1082364 ,    113 , 16567 ,0.901960784313725 ,              90 ,             0
SPAdesHybrid-MaxBin2Refined-CHST.003_sub ,SPAdesHybrid-MaxBin2-CHST.tsv  ,         32 ,             2 ,bacteria ,2066527 ,    373 , 15407 ,0.570343137254902 ,              63 ,             4

From the GTBDK taxonomy classification, three bins were classified. One belonging to Erwinia and two belonging to Wolbachia pipientis.

SPAdesHybrid-MaxBin2-CHST.001.fa,d__Bacteria;p__Pseudomonadota;c__Gammaproteobacteria;o__Enterobacterales;f__Enterobacteriaceae;g__Erwinia;s
SPAdesHybrid-MaxBin2-CHST.002.fa,d__Bacteria;p__Pseudomonadota;c__Alphaproteobacteria;o__Rickettsiales;f__Anaplasmataceae;g__Wolbachia;s__Wolbachia pipientis
SPAdesHybrid-MetaBAT2-CHST.3.fa,d__Bacteria;p__Pseudomonadota;c__Alphaproteobacteria;o__Rickettsiales;f__Anaplasmataceae;g__Wolbachia;s__Wolbachia pipientis

In the COGE run, only one was found.

bin,bin_set,unique_SCGs,redundant_SCGs,SCG_set,size,contigs,N50,bin_score,SCG_completeness,SCG_redundancy
SPAdes-MetaBAT2Refined-COGE.3,SPAdes-MetaBAT2-COGE.tsv,50,0,bacteria,1447577,51,72858,0.980392156862745,98,0

Which was classified as a Erwinia spp.

SPAdes-MetaBAT2-COGE.3.fa       d__Bacteria;p__Pseudomonadota;c__Gammaproteobacteria;o__Enterobacterales;f__Enterobacteriaceae;g__Erwinia

Next, i will give anvio a go to get more control for each step but is fun getting some results!