Substantial loss of sequence when using corrected reads to polish

kishwarshafin / pepper

PEPPER-Margin-DeepVariant

MIT License

245 stars 42 forks source link

Substantial loss of sequence when using corrected reads to polish #38

Closed ASLeonard closed 3 years ago

ASLeonard commented 3 years ago

I've been using pepper to polish 2.7gb mammalian genomes assembled with recent ONT reads with pretty good success. I tried repeating the assembly + polishing steps with Ratatosk corrected ONT reads. The assembly looked good, but the polishing step went pretty wrong.

My guess is whatever the post-correction error profile is too different from the pepper model (using haploid_guppy360.pkl) and hence did more harm than good. In this case, the size of the genome went from 2.65gb to 1.25gb (and estimated QV from 30 to 11). This is probably a "bad" use case for pepper, but I was surprised at the magnitude of the impact. I wasn't sure if you might have any other insight, or if this is just model incompatibility.

kishwarshafin commented 3 years ago

Hi @ASLeonard ,

I'm not familiar with Ratatosk. However, from your description of the issue, it seems like your reads don't have base qualities? Can you please check if your bam file includes base qualities?

ASLeonard commented 3 years ago

The output does include base qualities, but they are assigned based on the confidence of the correction via the de bruijn graph. A random string of base qualities was IIII!!!!!!!!IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII>>>>>>CCCCCCDDDDDDIIIIII compared to a random string from the raw reads 2:>@;A;<ECB=:::?8::8.**()'*&&,734()&00/011+,739:2%3(3.*/'&&&'/2012078>=:<<>9:9;7.

So the corrected reads have base qualities in the bam file but with a totally different distribution of values with long tracts of single values.

kishwarshafin commented 3 years ago

@ASLeonard it'd be extrapolating at this point but I think the entire read quality distribution shifts left after correction which affects the polishing. We have not trained a model for corrected reads so I can't exactly help in this case. So, you are kind of right in the sense that this is not an ideal use of PEPPER.