rrwick / Trycycler

A tool for generating consensus long-read assemblies for bacterial genomes
GNU General Public License v3.0
306 stars 28 forks source link

Assembly qscore plots #51

Closed sarah872 closed 1 year ago

sarah872 commented 1 year ago

Hi, I came across the assembly qscore vs read depth plot, and the accuracy worst 100 bp plot. I would like to make similar plots for my data. Would you be so kind to share the scripts?

Also, speaking of assembly qscores: The MISAG/MIMAG genome standards require an assembly to have a qscore of ≥50. This seems very high, as the flye assembly from the plot at 400x coverage only gets up to a qscore of ~46. What's your take on this requirement? Also, it's not clear what they mean, ie. mean qscore over the whole assembly?

rrwick commented 1 year ago

Here are the relevant files:

No promises that they are intelligible, though!

Regarding the assembly qscores, I do think Q50 is a pretty high standard, but it might come down to how you quantify it. For example, in my analysis, I count a 100-bp indel as 100 errors, so just one 100-bp error in a 5 Mbp assembly will knock the qscore down to 47. Considering how common indel errors are in repeat regions of the genome, I suspect very few isolate assemblies (let alone MAGs) reach Q50 when quantified this way. But if you quantify assembly accuracy in a more generous way (e.g. counting a 100-bp indel as one error, excluding repeat regions, etc), then Q50 is a lot more attainable.

Ryan

sarah872 commented 1 year ago

Thank you for this! What I actually meant was: how do you get the value of a single data point, i.e. the value for

Concerning the quality standards for MAGs, now that I thought more about it: There's always the need for a reference, right? Or is there a way to determine qscores from reads, i.e. via mapping?