tseemann / nullarbor

:floppy_disk: :page_with_curl: "Reads to report" for public health and clinical microbiology
GNU General Public License v2.0
136 stars 37 forks source link

Number of contigs - filtering by default? #235

Closed chrstraub closed 5 years ago

chrstraub commented 5 years ago

Hi again,

I've got a question regarding the discrepancy in number of contigs in contigs.fa and contigs.gff.

I did a comparison of SKESA vs SPAdes assembler and in terms of N50 and number of contigs SPAdes outperforms SKESA, according to the final report that is created.

SKESA

Isolate | Contigs | bp | ok | Ns | gaps | min | avg | max | N50 | CDS
18AR1237 | 117 | 2114148 | 2114148 | 0 | 0 | 504 | 18069 | 152163 | 46626 | 2079
18GQA01 | 117 | 2120247 | 2120247 | 0 | 0 | 520 | 18121 | 108270 | 38933 | 2078

SPAdes

Isolate | Contigs | bp | ok | Ns | gaps | min | avg | max | N50 | CDS
18AR1237 | 85 | 2135149 | 2134671 | 478 | 0 | 543 | 25119 | 207692 | 60466 | 2095
18GQA01 | 85 | 2138493 | 2137714 | 779 | 0 | 508 | 25158 | 207737 | 52612 | 2100

But when I have a look at the raw assembly files, i.e. contigs.fa - they have a much higher number of contigs. i.e. for SPAdes:

Sample | N contigs contigs.fa | N contigs contigs.gff | 
18AR1237_S30 |  317 | 85 
18GQA01_S6 | 339 | 85 

So my questions are:

tseemann commented 5 years ago

The --asembler option controls which plugin script is run. The plugins are here:

% ls nullarbor/plugins/assembler/*.sh
megahit.sh  shovill.sh  skesa_fast.sh  skesa.sh  spades.sh

Let's look at them:

% cat skesa.sh

skesa --fastq "$read1,$read2" \
      --cores "$cpus" \
      --vector_percent 1.0 \
      $opts \
      --contigs_out "$outdir/contigs.fa"
% cat spades.sh

WORKDIR=$(mktemp -d)
OUTDIR="$WORKDIR/spades"
spades.py -m 32 -o "$OUTDIR" \
  --pe1-1 "$read1" --pe1-2 "$read2" \
  -t "$cpus" \--careful $opts
cp -v -f "$OUTDIR/scaffolds.fasta" "$outdir/contigs.fa"
cp -v -f "$OUTDIR/assembly_graph_with_scaffolds.gfa" "$outdir/contigs.gfa"
cp -v -f "$OUTDIR/spades.log" "$outdir/contigs.log"
rm -frv "$WORKDIR"

They are using default parameters mostly BUT the prokka part sets the min contig length. But that is set by --minctllen in the nullarbor.pl

%/contigs.gff: %/contigs.fa
  gffout="$(@)" gbkout="$(@D)/contigs.gbk" contigs="$(<)" locustag="$(@D)" gcode="$(GCODE)" minlen="$(MIN_CTG_LEN)" $(ANNOTATOR)

You can control this via make MIN_CTG_LEN=100 when you run the Makefile. The default is 500.

This is also used in the assembly stats:

denovo.tab : $(CONTIGS)
  fa --minsize $(MIN_CTG_LEN) -e -t $^ > $@

The reason for this is that you can't compare assemblies that have a different minimum size. People using Nullarbor often re-use older folders over time, so this puts everyone on an even footing. Given readlengths/pair-span is ~500bp then it makes sense to exclude smaller.

YES, spades will be better than skesa. But SKESA is MUCH faster and is conservative (rare to get a mis-assembly) and is often good enough for public health typing.

tseemann commented 5 years ago

TLDR;

  1. contigs.fa is spades' scaffolds.fa, it's renamed to be consistent
  2. MIN_CTG_LEN=500 and use make MIN_CTG_LEN=100 to change it at run time or use --minctg LEN_BP Minimum contig length for Prokka and Roary
  3. prokka includes all input contigs in its output, but --minctglen is set
chrstraub commented 5 years ago

Thanks Torsten, for clarifying this!