lemene / PECAT

PECAT, a phased error correct and assembly tool
BSD 2-Clause "Simplified" License
38 stars 1 forks source link

Configuration file rule-of-thumb #26

Open jp-jong opened 4 months ago

jp-jong commented 4 months ago

Hi!

I'm just wondering if there is a manual, documentation, or a rule of thumb that can help us in setting the configurations when using PECAT. Recently, we used PECAT to correct ONT reads of a bacteria with an estimated 5.5mb genome size. I don't know whether my configuration is correct but I attached my correction configuration settings here.

` project= smarcescens reads= smarcescens_simplex.filtered.fastq genome_size= 5500000 threads=4 cleanup=1 grid=local

prep_min_length=3000 prep_output_coverage=80

corr_iterate_number=1 corr_block_size=4000000000 corr_filter_options=--filter0=l=5000:al=2500:alr=0.5:aal=5000:oh=1000:ohr=0.1 corr_correct_options=--score=weight:lc=10 --aligner edlib --filter1 oh=1000:ohr=0.01 corr_rd2rd_options=-x ava-ont corr_output_coverage=80 `

And I ended up from 16k reads to 12k reads (with N50 from 12.5kb to 12.9kb).

When I assembled (w/o polishing) the PECAT-corrected reads using a different assembler like NECAT (just to avoid assembler bias), I ended up with the following statistics: Contigs: 25 Assembly size: 5.6mb minimum length: 18kb max length: 1.3mb N50: 550kb

This statistics seem a bit far from a Canu-corrected reads as follows: Contigs: 4 Assembly size: 5.7mb minimum length: 17kb max length: 5.5mb N50: 5.5mb

So here, I noticed that when I assemble the PECAT-corrected reads, the assembly is highly fragmented as compared to Canu-corrected reads. Although I am quite aware that the statistics above doesn't entirely reflect the quality of the assembly; still, I feel like the PECAT-corrected reads weren't as "contiguous" as the Canu-corrected reads. That's why I'm wondering maybe I'm not setting the configuration file correctly.

Here's my Canu command: user/tools/canu-2.2/bin/canu -correct \ -p smarcescens_canu_corrected \ -d canu_correction_output \ genomeSize=5.5m \ correctedErrorRate=0.15 \ useGrid=false \ minReadLength=1000 \ corThreads=4 \ -nanopore-raw smarcescens_simplex.filtered.fastq 2>&1

And here's my NECAT command to assemble both reads from CANU and PECAT: PROJECT=necat_assembly ONT_READ_LIST= GENOME_SIZE=5500000 THREADS=4 MIN_READ_LENGTH=3000 PREP_OUTPUT_COVERAGE=40 OVLP_FAST_OPTIONS=-n 500 -z 20 -b 2000 -e 0.5 -j 0 -u 1 -a 1000 OVLP_SENSITIVE_OPTIONS=-n 500 -z 10 -e 0.5 -j 0 -u 1 -a 1000 CNS_FAST_OPTIONS=-a 2000 -x 4 -y 12 -l 1000 -e 0.5 -p 0.8 -u 0 CNS_SENSITIVE_OPTIONS=-a 2000 -x 4 -y 12 -l 1000 -e 0.5 -p 0.8 -u 0 TRIM_OVLP_OPTIONS=-n 100 -z 10 -b 2000 -e 0.5 -j 1 -u 1 -a 400 ASM_OVLP_OPTIONS=-n 100 -z 10 -b 2000 -e 0.5 -j 1 -u 0 -a 400 NUM_ITER=2 CNS_OUTPUT_COVERAGE=30 CLEANUP=1

I'd really appreciate it if you can give us ideas on how to set the parameters in PECAT.

Thanks!

lemene commented 4 months ago

Hi @jp-jong
The configure is ok. Here demo/configs config templates for genomes with different sizes. What are the total size and N50 of raw reads and the corrected reads ? PECAT may filter or truncate too low-quality reads, which cause fragments. Is there any statistics on the assembly of PECAT? These can help me adjust parameters.

In the parameters of NECAT, I think using CNS_READ_LIST instead of ONT_READ_LIST would be better as it would skip the error correction step.

jp-jong commented 4 months ago

Hi @lemene. I haven't used PECAT's assembling options yet but will try soon after I understand PECAT's correction step.

Here are the following statistics (using seqkit stats) for my raw reads and corrected reads: Raw Reads: Total size - 132,927,270 bp ; N50 - 12,542 Corrected Reads using PECAT: Total size - 121,927,673 bp ; N50 - 12,919 Corrected Reads using Canu Total size - 123,492,474 bp; N50 - 13,275

I also changed ONT_READ_LIST to CNS_READ_LIST when assembling using NECAT. Here are my statistics: Assembled PECAT-corrected Reads: Number of contigs: 5 Total size - 5,731,402 bp ; N50 - 5,511,194 Average length: 1,146,280.40 Assembled CANU-corrected Reads: Number of contigs: 4 Total size - 5,681,861 bp ; N50 - 5,471,264 Average length: 1,420,465.30

Here, disabling the correction in NECAT did make a difference with the assembly statistics (especially for PECAT-corrected reads) Nevertheless, here, I see that PECAT seems to correct more reads and resulted to a little lower total size of reads after correction. In terms of its assembly, it has a longer N50 but a bit lower average length of contigs compared to the assembled CANU-corrected reads.

@lemene What do you think? I'm quite satisfied with this comparison since there's not much of a difference but I'd appreciate any inputs from you on how to improve our correction configuration more.