nf-core / mag

Assembly and binning of metagenomes
https://nf-co.re/mag
MIT License
199 stars 103 forks source link

Genome binning is performed on uncorrected contigs when selecting the ancient DNA subworkflow #449

Closed alexhbnr closed 1 year ago

alexhbnr commented 1 year ago

Description of the bug

When selecting the ancient DNA sub-workflow using --ancient_dna, a correction of the consensus sequence reported by the assembler is performed. This step should remove artefacts that were wrongly called by the assembler due to the presence of ancient DNA damage.

However, for genome binning, nf-core/mag doesn't select these corrected contigs for the binning but uses the non-corrected contigs instead. This defeats the purpose of enabling the --ancient_dna sub-workflow.

Command used and terminal output

No response

Relevant files

No response

System information

jfy133 commented 1 year ago

@alexhbnr could you provide more information - as far as I can see what you want is already implemented in dev, as in the BCFtools consensus output (presumably with the corrected bases) is then passed to binning via the 'contigs_recalled` channel name:

image image

Can you provide an example somehow (or a reprex) where the wrong contigs were used downstream?

alexhbnr commented 1 year ago

In case, it's already in the dev branch and it seems to work, then that's what I want. In the major release v2.3.0 that I used the contig sequences found in the final MAGs were identical to the contigs found in the uncorrected samples. That's why I raised this issue.

jfy133 commented 1 year ago

That's concerning, I'm pretty sure when I looked earlier the code looked identical. Which mags were you looking at in the results directory (presumably)

alexhbnr commented 1 year ago

Here is the code that I used to start the pipeline:

nextflow run nf-core/mag -r 2.3.0 \
            -profile eva,archgen \
            --input 04-analysis/Zape2/nfcore_mag_samplesheet.csv \
            --outdir 04-analysis/Zape2/assembly \
            --skip_clipping \
            --skip_prodigal \
            --binning_map_mode own \
            --min_contig_size 1000 \
            --gtdb false \
            --binqc_tool checkm \
            --save_checkm_data \
            --refine_bins_dastool \
            --postbinning_input both \
            --run_gunc \
            --gunc_save_db \
            --ancient_dna

If I then do a simple pairwise comparison of the contig sequences that are found in a MAG in the folder GenomeBinning/DASTool/bins with either the original sequence of SPAdes Assembly/SPAdes/SPAdes-Zape2_MinE2X_scaffolds.fasta.gz and the consensus sequence returned from the ancientDNA workflow, Ancient_DNA/variant_calling/consensus/Zape2_MinE2X.fa, then diff identifies mismatches when comparing it to the consensus sequence but not the original sequences returned by SPAdes.

So I guess in the version I was running, the contigs used in DASTool aren't the ones returned in the consensus folder, which I assume are the corrected ones.

@maxibor will check my execution trace to see what's going on here.