Assemblying tiling amplicon sequencing nanopore data

MaestSi commented 4 years ago

Dear Canu developers, I would like to know if Canu may be suitable to assemblying tiling amplicon sequencing data, which are characterized by ups and downs in coverage depth, due to differential PCR primers efficiency and to overlapping between terminal portions of the amplicons. For example in case of covid19 assembly, the approach suggested by the artic network is to perform alignment to reference, call variants and produce the new consensus sequence, but I was curious to try out de novo assembly with Canu too. In that case the genome size is 30kbp and the amplicons (protocol V1) are about 400bp long, with about 40bp at each side overlapping to another amplicon, excluding the PCR primers, which I previously trimmed with porechop. This is the coverage of the reads mapped to the reference, which is very uneven (scale is 0-100k): ncov-2019_reads_coverage Changing the scale to 0-500 you can see few portions have very low coverage. ncov-2019_reads_0_500_coverage Then, I used this command to assemble them with Canu 2.0. canu -d canu_output -p SP1_covid19 genomeSize=30k -nanopore SP1_trimmed.fastq minReadLength=200 minOverlapLength=10 And these are the statistics of the assembly I got:

stats for SP1_covid19.contigs.fasta
sum = 34378, n = 54, ave = 636.63, largest = 3799

Coverage of the contigs reflects read coverage depth: ncov-2019_canu_contigs_coverage Do you think there are any parameters I should tune to do a better job? This is just a test I am doing, I know there is no point in assemblying de novo this data, but I am in the process of choosing the best assembly strategy for another project where the overlaps are longer and hopefully the coverage is more even. Thanks, Simone

skoren commented 4 years ago

You don't show where the contigs align to the reference, I'd guess they're breaking near the low coverage regions.

There are several issues with what you're trying to do:

40bp overlap is very small given the default k-mer is 22 for corrected reads and 16 for the raw data (so your 10bp overlap will never be found). This means 1 error in the corrected overlap will likely mean it's not going to be found.
Correction picks the longest 40x which will likely exclude the low coverage regions. It does try to select reads which represent unique sequence but I would expect it will still lose reads there which will break the assembly
Trimming expects all reads to be supported by at least two reads so it will remove the low coverage regions too.

I think, in addition to the options you already have, you want to try to correct all data (corOutCoverage=10000 corMinCoverage=0), use a small k-mer (MhapMerSize=11 OvlMerSize=11), and skip trimming (run canu -correct and then canu -assemble inputting the corrected reads as -nanopore-corrected). Not sure how much this will slow things down. The benefit of mapping first is you can downsample the data to even the coverage.

MaestSi commented 4 years ago

Hi Sergey, thank you for the advice. I tried running Canu with the suggested options:

canu -correct -d canu_correct -p SP1_covid19 genomeSize=30k -nanopore SP1_trimmed.fastq minReadLength=200 minOverlapLength=10 corOutCoverage=10000 corMinCoverage=0 MhapMerSize=11 OvlMerSize=11
canu -assemble -d canu_assemble -p SP1_covid19 genomeSize=30k -nanopore-corrected canu_correct/SP1_covid19.correctedReads.fasta.gz minReadLength=200 minOverlapLength=10 corOutCoverage=10000 corMinCoverage=0 MhapMerSize=11 OvlMerSize=11

I then mapped to the reference and loaded into IGV the contigs obtained with this method (Canu correct assemble), the contigs obtained with previous default parameters (Canu default) and the trimmed reads (Reads). ncov-2019_tests I would say there is an improvement in the covered portion of the genome, although small. Would it be possible now to assemble the contigs, or this is the best I can obtain, in your opinion? I tried with this command: canu -assemble -d canu_output_assembled_contigs -p SP1_covid19 genomeSize=30k -nanopore-corrected canu_assemble/SP1_covid19.contigs.fasta minReadLength=200 minOverlapLength=10 corOutCoverage=1 corMinCoverage=0 stopOnLowCoverage=0 MhapMerSize=11 OvlMerSize=11 minInputCoverage=1 but it stopped with errors. Thanks in advance, Simone

skoren commented 4 years ago

I'm generally not a fan of trying to assemble contigs but in this case that may be a reasonable approach since the contigs really represent the amplicons. Your command seems reasonable, what error did you get from the run? Are you able to share the data (either reads or contigs).

MaestSi commented 4 years ago

Hi, attached you can find Canu's log: canu_error.txt I sent you an e-mail where you can find an ftp link for downloading the reads. Thanks, Simone

MaestSi commented 4 years ago

Hi, is there any update related to the assembly of contigs? Thanks, Simone

skoren commented 4 years ago

Sorry I have not had a chance to investigate this in detail, I have your reads and will try to run some assemblies over the weekend.

gringer commented 4 years ago

Assuming these are all SARS-CoV-2 reads, alignment to the genome should be very good (even with substantial errors). I would recommend mapping and normalising to a more even depth prior to assembly. I've written a script to help do that, which takes a SAM file / pipe as input, and outputs either a SAM (default) or FASTQ file with reads excluded where their coverage exceeds a target threshold (default 100X). The general approach would be as follows:

$ minimap2 -ax map_ont Wuhan-Hu-1.fa reads.fq.gz | samtools sort > reads_vs_ref.bam
$ samtools view -h reads_vs_ref.bam | samNormalise.pl -c 100 -f fastq | gzip > reads_100X.fq.gz
$ canu <other_options> genomeSize=30k -nanopore-raw reads_100X.fq.gz

skoren commented 4 years ago

First, the reason you likely had the error assembling contigs is because you ended up with nothing assembled. I think the default error rate and contig filter isn't applicable to your case. You want to add (contigFilter="2 0 1.0 0.5 0" (see https://canu.readthedocs.io/en/latest/faq.html#my-asm-contigs-fasta-is-empty-why). However, I wasn't able to reproduce it when I ran locally because I had different initial contigs. In addition to the above you may also want to increase the error rate.

I ran the following commands (note the first one will keep all data, by default a random 200x subset is used which makes it much slower to run but the subset might lose low-coverage regions):

canu -p asm -d ncov -correct genomeSize=30k 'corOutCoverage=50000' 'corMinCoverage=0' 'minOverlapLength=20' 'mhapmerSize=10' 'ovlMerSize=10' 'minReadLength=100' 'corMhapSensitivity=normal' 'maxInputCoverage=20000' -nanopore-raw <input.fq>

canu -assemble -p asm -d ncov_asm genomeSize=30k-nanopore-corrected ncov/asm.correctedReads.fasta.gz

canu -assemble -d contigs_test genomeSize=30k -nanopore-corrected ncov_asm/asm.contigs.fasta minReadLength=200 minOverlapLength=10 corOutCoverage=1 corMinCoverage=0 stopOnLowCoverage=0 MhapMerSize=11 OvlMerSize=11 minInputCoverage=1

My assembly ended up with about 25 contigs. When I look at the mappings of the initial contigs to the reference, this seems to be about as good as you'd expect. There are several regions with no/low contig coverage. Looking at the initial correct reads, they do look high-quality (over 99% identity) and cover most of the genome with the exception of a few of 0-coverage regions. So I think to improve assembly you'd want to focus on the corrected reads not trying to assemble contigs. For example, I was able to get a better assembly from the initial corrected reads by running:

canu -p 'asm' -d asm -pacbio-corrected ncov/asm.correctedReads.fasta.gz  'minOverlapLength=50' 'minReadLength=300' 'ovlMerSize=11' 'genomeSize=30k' 'batMemory=10' 'maxInputCoverage=5000' 'correctedErrorRate=0.02' batOptions=

Which only uses the best quality reads and ends up around the expected genome size. Overall given the data type, I definitely think you don't want to just assemble all the data. Like @gringer said, I think you should even out the coverage to get a better and faster assembly (I used his strategy on the corrected reads I had and got the best assembly with 20 contigs ). You could also try taking the reads that don't map to the reference and assembling/correcting those separately to make sure there is nothing that was missed by the reference mapping. However, I wouldn't expect much of that since these are all amplicons anyway.

MaestSi commented 4 years ago

Thank you @skoren and @gringer ! Very useful advice! Best, Simone

marbl / canu

Assemblying tiling amplicon sequencing nanopore data #1674