johnlees / seer

sequence element (kmer) enrichment analysis
GNU General Public License v2.0
43 stars 9 forks source link

Incorrectly 'commented' as bad-chi square? #54

Closed flashton2003 closed 7 years ago

flashton2003 commented 7 years ago

Looking at the output of a seer run, lots of the samples have 'bad-chisq' in the 'comment' column. However, the chisq p-value for some of these is very low, down to 10E-27.

From the paper, it seems that the cutoff you use is chisq p-value > 10E-5, so I'm just wondering what is happening in the 'comment' column.

johnlees commented 7 years ago

This comment is added when the assumptions of the chi-square test is invalid, and will produce p-values much lower than it should do. This happens at a combination of low frequency and high effect size, leading to elements in the table having expected values close to zero. The cutoff will still be applied to these k-mers, but is likely to be invalid (but at least liberal).

The chi-squared test should be ignored in these cases, and caution used when interpreting the corrected values (though firth regression should be used, which may go some way to alleviating this issue). If you are getting a lot of these, it may be that you need to filter at a higher MAF.

This wiki page may be helpful: https://github.com/johnlees/seer/wiki/Comment-field