rwdavies / QUILT

GNU General Public License v3.0
45 stars 10 forks source link

INFO SCORE across multiple samples #24

Open SABiagini opened 1 year ago

SABiagini commented 1 year ago

Hi,

I have imputed a batch of 5000 individuals. As far as I understand, each of them has been independently imputed by QUILT.

However, I have noticed that the INFO SCORE for each variant is obviously only one. Therefore, I would like to know more about how the INFO SCORE is calculated when multiple samples are imputed together.

Also, in case it is a consensus INFO SCORE, I would like to know what its reliability would be for filtering variants across multiple individuals.

Thank you.

rwdavies commented 1 year ago

Hey,

So the INFO score being used is defined here https://www.well.ox.ac.uk/~gav/snptest/#info_measures Informally, the INFO score captures the non-uniformity of the genotype posteriors. If they are flat (non-informative), it is low, while if they are mostly concentrated in one genotype, it goes closer to 1. It is indeed always a consensus score, it doesn't really make sense for one sample (being entirely derived from the genotype posteriors - you could just use that somehow, e.g. the genotype posterior of the argmax genotype).

It's normally highly reliable for filtering across multiple individuals. See this for some general comments re: INFO score (though note, with QUILT, there could be calibration issues, so I would say it's more reliable for STITCH than for QUILT) https://github.com/rwdavies/STITCH/issues/75

Let me know if you have more questions, Robbie

SABiagini commented 1 year ago

Hi Robbie,

Thanks for clarifying about the INFO SCORE being a consensus when imputing multiple samples together.

Your insight on filtering using INFO SCORE being more reliable for STITCH than QUILT helped me to better understand my observations.

I have 3 imputed samples that I use as proxies for testing filtering strategies, and I also have the high coverage copy for each of them.

In my analyses, I noticed that using only the INFO SCORE filter with a threshold of <0.4 is not effective in improving data quality. Specifically, I found that for all the statistics I tested (sensitivity, accuracy, precision, specificity, among others), the filtered data remained similar to the unfiltered ones. However, increasing the filter threshold to 0.8 improved the statistics, but it would be too aggressive.

Nevertheless, I discovered that combining INFO SCORE with other filters can be useful. At least, this is what I observed in my tests on these 3 separate samples. Dealing with multiple samples, of course, presents a challenge, but that's a different story!

Thanks again.

Best,

S.