marbl / canu

A single molecule sequence assembler for genomes large and small.
http://canu.readthedocs.io/
660 stars 179 forks source link

How to get corrected fastq files? #2337

Closed Rohit-Satyam closed 2 months ago

Rohit-Satyam commented 3 months ago

Hi Canu developers

We have some Sequence-Independent, Single-Primer-Amplification (SISPA) generated samples that were sequenced on ONT and we wish to perform error correction since we know that here we are marrying two error prone technologies: SISPA and ONT. We wish to show that Canu error correction steps takes care of lot of these error, but to do that we need corrected fastq files and wish to perform this on all reads >100bp in length. However, I realise that canu correction steps generates fasta files not fastq files.Is there a way to do that?

I am currently using the following command:

canu -correct  \
                 -p sample1 \
                  -d assembly minReadLength=100 minOverlapLength=100 genomeSize=10000 \
                   maxInputCoverage=20000 \ ## To prevent subsampling since Maximum coverage detected was 900X
                   -nanopore-raw sample1.fq.gz 2> modified.log
Rohit-Satyam commented 3 months ago

I am trying to see a good tradeoff between the sensitivity and coverage to get corrected reads. Since Canu give fasta file only I tried to align with my viral reference and it appears the results for the SISPA protocol is as follows:

Note: We are using latest Chemistry with Flow Cell R10.4 and Canu version 2.2 downloaded using conda. Screenshot_2024-08-28_15-37-33

Original reads: Obtained after dehosting using hostile and chimeric read removal using YACRD Default: Correction using default settngs of Canu + minReadLength=100 minOverlapLength=100 genomeSize=10000 maxInputCoverage=20000: Using Default+ maxInputCoverage=20000 parameter corMinCoverage=0 corOutCoverage=20000 corMhapSensitivity=high: Using parameters described in issue here

skoren commented 3 months ago

I'm not sure what you intended to use the quality values for but you can generally consider the full corrected read as "high-quality." There's minimum coverage and trimming built-in. However, the correction does tend to mix up close haplotypes.

The quality values are not stored from the initial reads plus the consensus of each read is completely re-generated and reads can increase/shrink in size so there's no way to get the original qualities. There are some coverage-based quality estimates but those hit Q60 very quickly so I suspect all your reads would be Q60 since all they capture is how many reads aligned to correct this position, not whether this alignment was correct or not.