ncsa / NEAT

NEAT (NExt-generation Analysis Toolkit) simulates next-gen sequencing reads and can learn simulation parameters from real data.
Other
47 stars 14 forks source link

sequencing error rate scale #75

Closed mattbird567 closed 1 year ago

mattbird567 commented 1 year ago

Less of an issue and more of a question but i'm struggling to understand the scale of the sequencing error variable. As an example you use Pacbio reads which use -E 0.10 but Pacbio reads have an average sequencing error rate of 15%? Does that mean 0.10 equaltes to 10% error rate? I'm trying to simulate illumina reads with the average sequencing error rate (which on the hogh end is 0.1 per base sequenced so i set -E to 0.1 but am now thinking that may be wrong? Could you explain the scale of -E to me?

joshfactorial commented 1 year ago

Yeah, generally it's designed to maintain the same error patterns in the reads, but scale them by -E. So, it should be that if you give it -E 0.10, then your average error rate in the output should be 10%. The way it works is the error model (whether you use the default or a custom one) has an error rate, which may be 15%. NEAT tries to retain the 'shape' of the errors (where they appear on the read, generally towards the ends on Illumina machines), while just making errors generally less frequent. In practice, though, I'm not sure it works the way it should. But maybe for larger datasets than what I was testing on, it would work better.

mattbird567 commented 1 year ago

Ok thanks for the clarifiction!