Closed ASLeonard closed 3 years ago
Hi @ASLeonard ,
I'm not familiar with Ratatosk
. However, from your description of the issue, it seems like your reads don't have base qualities? Can you please check if your bam file includes base qualities?
The output does include base qualities, but they are assigned based on the confidence of the correction via the de bruijn graph. A random string of base qualities was IIII!!!!!!!!IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII>>>>>>CCCCCCDDDDDDIIIIII
compared to a random string from the raw reads 2:>@;A;<ECB=:::?8::8.**()'*&&,734()&00/011+,739:2%3(3.*/'&&&'/2012078>=:<<>9:9;7
.
So the corrected reads have base qualities in the bam file but with a totally different distribution of values with long tracts of single values.
@ASLeonard it'd be extrapolating at this point but I think the entire read quality distribution shifts left after correction which affects the polishing. We have not trained a model for corrected reads so I can't exactly help in this case. So, you are kind of right in the sense that this is not an ideal use of PEPPER.
I've been using pepper to polish 2.7gb mammalian genomes assembled with recent ONT reads with pretty good success. I tried repeating the assembly + polishing steps with Ratatosk corrected ONT reads. The assembly looked good, but the polishing step went pretty wrong.
My guess is whatever the post-correction error profile is too different from the pepper model (using
haploid_guppy360.pkl
) and hence did more harm than good. In this case, the size of the genome went from 2.65gb to 1.25gb (and estimated QV from 30 to 11). This is probably a "bad" use case for pepper, but I was surprised at the magnitude of the impact. I wasn't sure if you might have any other insight, or if this is just model incompatibility.