Help needed on improving performance of skesa assemblies in some downstream applications

andersgs commented 5 years ago

Hello.

Skesa is awesome assembler. But, I have stumbled on an issue and I was hoping you may be able to help. I detail what I have done so far and where I think the problem is below. Please let me know if you need anything else.

The tool sistr can be found here: https://github.com/peterk87/sistr_cmd

Background

I am performing limit of detection experiments for detecting Salmonella serotype using the tool sistr. In this experiment, I randomly downsample reads to various depths (e.g., 10, 20, etc), assemble them, and run the assembly through sistr. I am experimenting with spades.py 3.12.0, shovill 1.0.1 (using spades) and skesa 2.2.0. Each combination of sample/depth was repeated 10X.

sistr attempts three different methods of serotype inference: (1) by matching the assembly to a DB of antigen genes; (2) Based on mash against a curated set of genomes; and (3) using their own 330 cgMLST.

In our pipeline, a successful inference is attributed to assemblies that result in concordant calls across cgMLST and antigen methods. We also require the correct inference of O antigen, the H1 and/or H2 (Enteritidis only has H1 – Salmonella antigenic formula wiki), and the Serogroup.

We use NextSeq 500 data.

Commands

Skesa assembly:

skesa --gz --fastq R1.fastq.gz,R2.fastq.gz --use_paired_ends --vector_percentage 1 > skesa.fa

Spades assembly:

spades.py -o spades -1 R1.fastq.gz -2 R2.fastq.gz

Shovill assembly:

shovill -o shovill --R1 R1.fastq.gz --R2 R2.fastq.gz

Results

At the moment, I have run the pipeline with two isolates from SRA (both serovar Enteritidis): SRR7284860, SRR7284877. The preliminary results suggest that skesa assemblies performs significantly worse than spades or shovill:

prob

Digging a bit deeper, this results is in part due to isolate SRR7284860 performing poorly at practically every depth (although the other isolate only really started to perform relatively well at depths 70 and larger):

raw_dat

When we examine why skesa assemblies are failing at such a high rate, it seems the Serogroup inference is missing. Serogroup is determined by wzx or wzyThis isolate seems to have relatively uneven and poor coverage over one of the crucial genes determining serogroup (wzx --- figure below is mapping of the whole read set to the gene):

read_depth

Which, I guess, once we start to subsample the reads causes issues.

To investigate a little further, I used minimap2 to bait reads mapping to the wzx gene in isolate SRR7284860, and used samtools to create a pair of FASTQ files (only kept reads mapping to the region and that were properly paired). This is one of the central genes associated with determination of serogroups. When I try to assemble these reads with spades.py I get 2 contigs that roughly span the whole gene (with a break around that region of zero coverage). With skesa, I also get two contigs, but they are far far smaller. I am attaching the contigs generated from the assembly of the reads in the gene, as well as the reads mapping to the gene.

This allele of the gene has about ~30% GC content, and that seems to be the average for this gene.

Attached files

r1.fq, f2.fq --- paired end reads obtained by baiting the whole read set of SRR7284860 for the wzx gene
skesa.fa --- assembly of the the above reads
spades.fa --- assembly of the above reads
seq.fasta --- reference sequence of the wzx gene

SRR7284860_wzx.zip

souvorov commented 5 years ago

SKESA was designed to avoid questionable connections and extensions. On the other hand, SPAdes indeed uses every read to extend the assembly. This may result in less than perfect assembly if this last read is bogus. You can relax SKESA slightly using --allow_snps options (see issue #5 for explanation about additional output). Even with this option SKESA will not assemble the whole gene because of the coverage gap.

I understand that you do the assembly just to find some specific genes. For this purpose we have a better tool which we call guided assembler. The idea behind it is to align target sequences directly to the de Bruijn graph and find all paths with reasonably good alignments. If you think that this application, which is not public yet, might be useful for your project send me an email to souvorov@ncbi.nlm.nih.gov for further discussion.

andersgs commented 5 years ago

Thank you @souvorov. Appreciate the answer. I'll give --allow_snps a go and will be sending you an email.

tseemann commented 4 years ago

@souvorov is the "targeted gene assembler" tool ready for public testing yet?

souvorov commented 4 years ago

@tseemann we are getting close - I'd say one more month. The new assembler(s) will be able to use both nucleotide and protein references.

ncbi / SKESA