rwdavies / QUILT

QUILT: Low coverage whole genome sequence imputation with large reference panels
https://www.nature.com/articles/s41588-021-00877-0
GNU General Public License v3.0
54 stars 11 forks source link

Sequence quality effects #39

Open LizzieMcDizzie opened 3 months ago

LizzieMcDizzie commented 3 months ago

Hi. Thanks for QUILT - it is really great!

I was wondering if you have information on the effect of data quality (phred scores) on the results. Basically I am wanting to work out if trimming the data is best, or if QUILT will account for low quality bases in the algorithm? E.g. if I had a sample that was sequenced to 0.12X, and I would normally trim away around ~20% based on quality (ending up with 0.1X), would I get better results by only including the 0.1X of higher quality data, or by just including all 0.12X?

Is this a 'less is more' situation, or 'more is more'?

Thanks!

rwdavies commented 3 months ago

Thanks!

I've not evaluated this myself, and I think this could probably only truly be evaluated empirically, because of the number of factors at play. I've generally found the INFO score to be a very good predictor of imputation accuracy, so you could try running a few samples twice, one without filtering and one with, and see how the mean INFO scores compare, at various classes of SNPs (e.g. common or rare).

More generally I think it depends on whether the Phred scores are calibrated for these parts of these reads, and if the error is random. If errors are random and Phred scores are calibrated I would definitely expect more data to be better. As these conditions stop being met, particularly the randomness of the error, I think the extra data would be less useful, and things could potentially get worse.

Hope that helps!