tompollard commented 6 years ago

The following comments need to be addressed:

Associate Editor Comments to the Author: I would like to mark this as "acceptable" because certainly it can be useful to people who are sufficiently knowledgeable about statistics in the first place, and it can play a circumscribed role for a certain range of studies. But I don't see that it gives adequate guidance to others to avoid bad decisions, or enough flexibility to make good decisions. For example, when should Bonferroni be used to correct for multiple testing? How to decide? And if another method is appropriate, which is not included in the package, what to do? (btw Bonferroni is rarely the best method, and Sidak is not either). I think there needs to be further changes, both to the paper and especially to the documentation, to avoid misuse.

jraffa commented 6 years ago

Bonferroni's correction is usually considered "conservative" in the sense that it corrects too much (p-values are too "big"). The Associate editor is correct that there might be another method that is better suited depending on the situation. What I would suggest is updating the documentation to reflect this, along the lines of:

Bonferroni's correction addresses the problem of multiple comparisons in a simple way by dividing the prespecified significance level (Type I error rate, $\alpha$) by the number of hypothesis tests conducted. This correction is known to over correct, effectively reducing the statistical power of the tests, particularly when the number of hypotheses are large, or when the tests are positively correlated. There are many alternatives which may be more suitable and also widely used, including:

Benjamini, Yoav; Hochberg, Yosef (1995). "Controlling the false discovery rate: a practical and powerful approach to multiple testing". Journal of the Royal Statistical Society, Series B. 57 (1): 125–133.

Holm, S. (1979). "A simple sequentially rejective multiple test procedure". Scandinavian Journal of Statistics. 6 (2): 65–70.

Šidák, Z. K. (1967). "Rectangular Confidence Regions for the Means of Multivariate Normal Distributions". Journal of the American Statistical Association. 62 (318): 626–633.

Please consider investigating one of these alternatives, if you think you may be in a situation to be adversely effected by the conservative nature of the Bonferroni correction.

tompollard commented 6 years ago

@jraffa we have around a week to submit the revised version. If you have time, please could you help to address the comments above by thinking about the following points?:

Could we add a paragraph or two to the manuscript that outlines common problems with summary statistics and how to address them? e.g. Common issues are xxx. When generating summary statistics, we recommend xxx. For further guidance, we recommend...
A similar paragraph that can be repeated in the readme, documentation, and notebook, to warn the researcher of common issues to look out for and how to address them.

alistairewj commented 6 years ago

I think we could have a documentation page called "best practices" or something similar with these suggestions. Then, on the quickstart documentation page (which will be the most frequently used page by far), we could run through the sample dataset on this page and link out to this page. So something like:

"You can summarize data using either medians or means (When should I use the median?)", where the "(When should I use the median?)" links to an appropriate section of the best practices page. My initial thoughts for topics we should cover:

When should I use the median?
When should I report p-values in Table 1?
When should I correct for multiple hypothesis testing?

Thoughts?

tompollard commented 6 years ago

This sounds like a good plan. So in summary:

create best practices page in documentation
link to the best practices page from the notebook, quick start, etc.
add paragraph or two to manuscript on best practices

jraffa commented 6 years ago

Blurb about using nonnormal:

Considerations for choosing Tableone parameters

For numeric variables, including integer and floating point values in addition to some ordered discrete variables, the nonnormal argument of TableOne merits some discussion. The practical consequence of including a variable in the nonnormal argument is to rely on rank based methods \cite{lehmann1975nonparametrics,conover1981rank} for estimation of the center and variability of the distribution for the relevant variable, along with non-parametric methods to conduct hypothesis testing evaluating if the distributions of all the groups are the same.

When a variable is normally distributed, both estimation and hypothesis testing (provided the standard deviations of each group are the same) are more efficient when the variable is not set in nonnormal argument \cite{hodges1956efficiency,zimmerman1987comparative}. This may also hold in some circumstances where the data are clearly not normally distributed, provided the sample sizes are large enough. In other situations assuming normality when the data is not normally distributed can lead to inefficient or spurious inference.

The mean and standard deviation are often poor estimates of the center or dispersion of a variable’s distribution when the distribution: is asymmetric, has ‘fat’ tails and/or outliers, contains only a very small finite set of values or is multi-modal. Although formal statistical tests are available to detect most of these features, they often are not very useful in small sample sizes \cite{razali2011power}. Plotting the distribution of each variable by group level (via histograms, kernel density estimates or boxplots) is a crucial component to data analysis pipelines, and often is the only way to detect problematic variables in many real-life scenarios. One alternative to data visualization is to run all numeric variables in the nonnormal and none of these variables under nonnormal argument using two different calls of TableOne. Then one can focus on situations where: substantial differences exist between the mean and median estimates, the median or mean is not well centered between the first and third quartiles \cite{altman1996detecting}, large differences exist between the absolute differences in the first and third quartile (the IQR) and the standard deviation, understanding that the IQR will be about 35% larger than the standard deviation under normality.

A particular situation to note is when the number of groups are three or more and the group variances differ to a large degree. Under such a situation it may be preferable to consider the data as nonnormal, even if each group’s data were generated from a normal distribution \cite{boneau1960effects}, particularly when the group sizes are unequal or the sample sizes are small. When the number of groups are limited to two, this is addressed using Welch’s two sample t-test which is generally both efficient and robust under unequal variances between two groups \cite{welch1947generalization}. A similar type of test exists for one-way ANOVA \cite{weerahandi1995anova}, but is currently not implemented.

Thus far we have suggested methods which vary estimation and hypothesis testing techniques when a normality assumption is not appropriate. Alternatives do exist which may be more practical to your situation. In many circumstances transforming the variable can reduce the influence of asymmetry or other features of the distribution. Under monotone transformations (e.g., logarithm or square root for strictly positive number) this should have little impact on any variable which is included in the nonnormal argument, as these methods will typically be invariant to this class of transformation.

It should be noted that while we have tried to use best practices, automation of even basic statistical tasks can be perilous if done without supervision, and we encourage users to use TableOne alongside other methods of descriptive statistics and, in particular, data visualization to ensure appropriate handling of their data.

@article{lehmann1975nonparametrics, title={Nonparametrics: statistical methods based on ranks}, author={Lehmann, Erich Leo and D'Abrera, Howard JM}, year={1975}, publisher={Holden-Day} }

@article{conover1981rank, title={Rank transformations as a bridge between parametric and nonparametric statistics}, author={Conover, William J and Iman, Ronald L}, journal={The American Statistician}, volume={35}, number={3}, pages={124--129}, year={1981}, publisher={Taylor \& Francis Group} }

@article{zimmerman1987comparative, title={Comparative power of Student t test and Mann-Whitney U test for unequal sample sizes and variances}, author={Zimmerman, Donald W}, journal={The Journal of Experimental Education}, volume={55}, number={3}, pages={171--174}, year={1987}, publisher={Taylor \& Francis} }

@article{hodges1956efficiency, title={The efficiency of some nonparametric competitors of the t-test}, author={Hodges Jr, JOSEPH L and Lehmann, Erich L}, journal={The Annals of Mathematical Statistics}, pages={324--335}, year={1956}, publisher={JSTOR} }

@article{boneau1960effects, title={The effects of violations of assumptions underlying the t test.}, author={Boneau, C Alan}, journal={Psychological bulletin}, volume={57}, number={1}, pages={49}, year={1960}, publisher={American Psychological Association} }

@article{welch1947generalization, title={The generalization ofstudent's' problem when several different population variances are involved}, author={Welch, Bernard L}, journal={Biometrika}, volume={34}, number={1/2}, pages={28--35}, year={1947}, publisher={JSTOR} }

@article{weerahandi1995anova, title={ANOVA under unequal error variances}, author={Weerahandi, Samaradasa}, journal={Biometrics}, pages={589--599}, year={1995}, publisher={JSTOR} }

@article{altman1996detecting, title={Detecting skewness from summary information}, author={Altman, Douglas G and Bland, J Martin}, journal={British Medical Journal}, volume={313}, number={7066}, pages={1200--1201}, year={1996}, publisher={BMJ Publishing Group Ltd.} }

@article{razali2011power, title={Power comparisons of shapiro-wilk, kolmogorov-smirnov, lilliefors and anderson-darling tests}, author={Razali, Nornadiah Mohd and Wah, Yap Bee and others}, journal={Journal of statistical modeling and analytics}, volume={2}, number={1}, pages={21--33}, year={2011} }

tompollard commented 6 years ago

I have added the following note to the README:

A note for users of tableone

While we have tried to use best practices in creating this package, automation of even basic statistical tasks can be unsound if done without supervision. We encourage use of tableone alongside other methods of descriptive statistics and, in particular, visualization to ensure appropriate data handling.

It is beyond the scope of our documentation to provide detailed guidance on summary statistics, but as a primer we provide some considerations for choosing parameters when creating a summary table at: http://tableone.readthedocs.io/en/latest/bestpractice.html.

Guidance should be sought from a statistician when using tableone for a research study, especially prior to submitting the study for publication.

... also added a similar note to the demo notebook: https://github.com/tompollard/tableone/blob/master/tableone.ipynb
... and added a best practice section to the documentation: http://tableone.readthedocs.io/en/latest/bestpractice.html

tompollard commented 6 years ago

Response to feedback

We understand your concern that the package has potential for misuse if applied indiscriminately and have taken several additional steps to help address this issue. Firstly, we have added several paragraphs to the document (shown in the tracked changes) to make it very clear that the package should not be used in isolation. These changes are best summarised by the concluding sentence of the paper:

“It should be noted that while we have tried to follow best practices, automation of even basic statistical tasks can be unsound if done without supervision. We therefore suggest seeking guidance from a statistician when using tableone for a research study, especially prior to submitting the study for publication.”

Alternatives to Bonferroni correction are now explicitly discussed in the paper and documentation, with versions of the following paragraphs. In addition, the package now supports several alternatives to the Bonferroni correction, allowing the user to apply any test implemented in the statsmodels multitest module.

“Bonferroni's correction addresses the problem of multiple comparisons in a simple way by dividing the prespecified significance level (Type I error rate, $\alpha$) by the number of hypothesis tests conducted. This correction is known to over correct, effectively reducing the statistical power of the tests, particularly when the number of hypotheses are large, or when the tests are positively correlated. There are many alternatives which may be more suitable and also widely used, including:

Benjamini, Yoav; Hochberg, Yosef (1995). "Controlling the false discovery rate: a practical and powerful approach to multiple testing". Journal of the Royal Statistical Society, Series B. 57 (1): 125–133.

Holm, S. (1979). "A simple sequentially rejective multiple test procedure". Scandinavian Journal of Statistics. 6 (2): 65–70.

Šidák, Z. K. (1967). "Rectangular Confidence Regions for the Means of Multivariate Normal Distributions". Journal of the American Statistical Association. 62 (318): 626–633.

Please consider investigating one of these alternatives, if you think you may be in a situation to be adversely affected by the conservative nature of the Bonferroni correction.”

We have introduced a “Best Practice” section to the online documentation which is intended to act as a primer for new users of the package. The best practice content begins by emphasising that guidance should be sought "from a statistician when using tableone for a research study, especially prior to submitting the study for publication". We also highlight the importance of visualizing data:

“Plotting the distribution of each variable by group level via histograms, kernel density estimates and boxplots is a crucial component to data analysis pipelines. Visualisation is often is the only way to detect problematic variables in many real-life scenarios. Some example plots are provided in the tableone notebook”.

The demonstration notebook also emphasises the points discussed above, and includes sample code for creating kernel density estimates and boxplots.

When taken together, we hope that these steps help to allay your concerns. We agree that continual care and community feedback will be needed to prevent inappropriate use of the package. Overall, as discussed previously, we also believe that offering an open source tool such as tableone can have a positive role in promoting good practice.

tompollard / tableone

Editor's comments #55

A note for users of tableone

Response to feedback