Outliers & best practices

statdivlab / corncob

Count Regression for Correlated Observations with the Beta-binomial

103 stars 22 forks source link

formula = ~ Chemical_characteristic+Region+Year, phi.formula = ~ Region+Year, formula_null = ~ Region+Year, phi.formula_null = ~ Region+Year

Hi @ianartmor!

Copying from my email for others' reference.

In terms of outliers, I wouldn't worry too much about them. Corncob is designed specifically to handle highly dispersed count data. It doesn't handle outliers in any specific way, nor do I think that it should, frankly. I don't think any points will have such high leverage so as to substantially change results, depending on your sample size, so I'm glad your exploration of that confirmed that. In general, I recommend against removing outliers, unless you have reason to believe they are invalid. Most people just do it to p-hack usually :). Ultimately the choice is yours though! My main point is corncob can handle them fine.
5 per group is fine! 0 is not. However, you don't have interaction effects, so you actually only need to worry about the marginal groupings, not the combinations. One thing to be aware of is that your estimates might have high standard errors and low power, so you might want to look into that. Mathematically, as long as you have some observations per group, and in your case, not even the group interactions, it should be just fine.
corncob will automatically filter for mathematically impossible models to fit as described in point two, so you don't need to worry about filtering. (see the filter_discriminant parameter in differentialTest). In terms of sanity checks, I highly recommend plots of a few individual model fits. Check some highly significant ones, and some random ones, fit a single model using bbdml or just extract them from differentialTest with full_output = TRUE, and type plot(YOUR_OUTPUT). Visual investigation is the best sanity check in my opinion.

Cheers, Bryan

statdivlab / corncob

Outliers & best practices #95