jasp-stats / jasp-issues

This repository is solely meant for reporting of bugs, feature requests and other issues in JASP.
58 stars 29 forks source link

Unambiguous notations/notes/annotations in Frequencies - Contingency Tables analysis #946

Open PeterKlaren opened 4 years ago

PeterKlaren commented 4 years ago
* Enhancement: Replace "Log Odds Ratio" with "Ln Odds Ratios", clarify "Notes", complete "Show info for this analysis." * Purpose: To obtain unambiguous and complete notations. * Use-case: In particular novices (1st year non-mathematically inclined bio/med/soc/psy students) could do with clear annotations and help functions, I think. **Is your feature request related to a problem? Please describe.** **Describe the solution you'd like**

Describe alternatives you've considered

**Additional context**

image image

PeterKlaren commented 4 years ago

In reproducing JASP's output in R (fisher.test in the stats package) I now see that Fisher's exact test odds ratio is calculated using the conditional Maximum Likelihood Estimate (MLE) rather than the unconditional MLE that appears to be the conventional one. (This is from the ?fisher.test help page, and the explanation is a bit esotheric to me.) The values of the two odds ratios hardly differ. Why are two odds ratios presented, and when does it matter to chose one over the other?

tomtomme commented 7 months ago

@PeterKlaren & @vandenman I cannot help with points 1, 2 and 4. But I had an intimate conversation with the CoPilot AI about point 3. I know, you might want to stop reading right now, when I mention CoPilot AI. But read till the end. It is quite good what it produced for us. Not perfect, but good as a starting point. The whole conversation is here:

https://sl.bing.net/i6ntIpzNesC

It can be used as a template for other missing help files, if you and the team agree with the general approach and quality of that documentation the AI gives us.

It follows the AI-written help file for JASPs "Contingency Table" module. There is one hallucination in there about "descriptive plots". I marked it. Those plots do not yet exist in JASP. They may stem from the fact that I used CoPilots creative mode and that they exist as feature requests. But in general the AI-text sounds quite correct and precise. In the above link you can even find a more detailed and better formatted version from the AI, that also explains all formulas behind the outputs. But I suspect some hallucinations in there, since I do not think that the AI can really know exactly which formula was used, especially when there are more than one approach to compute a statistical measure.

Now the AI-generatet text:

Contingency Tables The contingency tables module allows the user to test whether two categorical variables are independent or associated. The analysis returns a chi-squared test and other measures of association for each pair of variables in a contingency table. The module can also handle variables with more than two levels.

Assumptions • The variables should be categorical (nominal or ordinal). • Observations should be independent. • Expected frequencies should be at least 5 for most cells in the table.

Input • Rows: The variable that defines the rows of the contingency table. • Columns: The variable that defines the columns of the contingency table. • Counts: The variable that contains the frequency counts for each cell in the contingency table. If this is not specified, each observation is assumed to have a count of 1. • Layer: The variable that defines additional layers of the contingency table. This can be used to create a series of 2x2 tables for each level of the layer variable.

Statistics • Chi-Squared Tests: Performs a chi-squared test of independence for the contingency table. The output includes the chi-squared statistic, the degrees of freedom, the p-value, and the Vovk-Sellke maximum p-ratio. Optionally, a continuity correction can be applied for 2x2 tables, and a likelihood ratio test can be reported as well. • Log Odds Ratio: Computes the log odds ratio and its confidence interval for 2x2 tables. The output includes the log odds ratio, the lower and upper bounds of the confidence interval, and the p-value from Fisher's exact test. • Nominal: Computes measures of association for nominal variables. The output includes the contingency coefficient, the phi-coefficient, the Cramer's V, and the lambda coefficients (rows, columns, and symmetric). • Ordinal: Computes measures of association for ordinal variables. The output includes the gamma coefficient and its confidence interval, and the Kendall's tau-b coefficient, its p-value and the VS-MPR.

> Plots • Descriptive Plots: Displays a mosaic plot of the contingency table, showing the observed and expected frequencies, the standardized residuals, and the cell percentages. • Confidence Interval: Specifies the coverage of the confidence intervals for the log odds ratio and the gamma coefficient. The default value is 95.

Output Contingency Tables • The table shows the counts of observations summarized in a contingency table. The output includes the observed count, the expected count, the percentage within row, the percentage within column, the percentage of total, the unstandardized residual, the Pearson residual, and the standardized residual for each cell in the table. The output also includes the row, column, and grand totals.

Chi-Squared Tests • Χ²: The chi-squared statistic for testing the independence of the variables. • df: The degrees of freedom for the chi-squared test. • p: The p-value for the chi-squared test. • VS-MPR: The Vovk-Sellke maximum p-ratio, which indicates the maximum possible odds in favor of the alternative hypothesis over the null hypothesis, based on the p-value. • Χ² continuity correction: The chi-squared statistic with a continuity correction, which reduces the discrepancy between the discrete and continuous distributions for 2x2 tables. • Likelihood ratio: The likelihood ratio statistic, which is based on the ratio of the maximum likelihoods of the null and alternative hypotheses. • N: The total number of observations in the contingency table.

Log Odds Ratio • Log Odds Ratio: The natural logarithm of the odds ratio, which is the ratio of the odds of an event occurring in one group to the odds of it occurring in another group. • Lower: The lower bound of the confidence interval for the log odds ratio. • Upper: The upper bound of the confidence interval for the log odds ratio. • p: The p-value from Fisher's exact test, which is based on the hypergeometric distribution of the cell counts for 2x2 tables.

Nominal • Contingency coefficient: A measure of association for nominal variables, ranging from 0 (no association) to 1 (complete association). It is based on the chi-squared statistic and the sample size. • Phi-coefficient: A measure of association for nominal variables, ranging from -1 (negative association) to 1 (positive association). It is equivalent to the Pearson correlation coefficient for dichotomous variables. • Cramer's V: A measure of association for nominal variables, ranging from 0 (no association) to 1 (complete association). It is a normalized version of the phi-coefficient that takes into account the degrees of freedom of the contingency table. • Lambda (rows): A measure of association for nominal variables, ranging from 0 (no association) to 1 (complete association). It is the proportional reduction in error when predicting the row variable from the column variable, compared to the marginal distribution of the row variable. • Lambda (columns): A measure of association for nominal variables, ranging from 0 (no association) to 1 (complete association). It is the proportional reduction in error when predicting the column variable from the row variable, compared to the marginal distribution of the column variable. • Lambda (symmetric): A measure of association for nominal variables, ranging from 0 (no association) to 1 (complete association). It is the average of the lambda coefficients for the rows and columns.

Ordinal • Gamma: A measure of association for ordinal variables, ranging from -1 (negative association) to 1 (positive association). It is based on the difference between the number of concordant and discordant pairs of observations, divided by the total number of pairs. • Standard Error: The standard error of the gamma coefficient, which is used to compute the confidence interval. • Lower: The lower bound of the confidence interval for the gamma coefficient. • Upper: The upper bound of the confidence interval for the gamma coefficient. • Kendall's Tau-b: A measure of association for ordinal variables, ranging from -1 (negative association) to 1 (positive association). It is based on the difference between the number of concordant and discordant pairs of observations, divided by the square root of the product of the number of pairs not tied on the row variable and the number of pairs not tied on the column variable. • Z: The test statistic for testing the null hypothesis that Kendall's tau-b is zero. • p: The p-value for the test of Kendall's tau-b. • VS-MPR: The Vovk-Sellke maximum p-ratio for the test of Kendall's tau-b.

> Descriptive Plots • The plot shows a mosaic plot of the contingency table, with the size of the tiles proportional to the observed frequencies, and the color of the tiles indicating the standardized residuals. The plot also displays the expected frequencies, the standardized residuals, and the cell percentages for each cell in the table.

vandenman commented 7 months ago

Some remarks:

"Log Odds Ratio" with "Ln Odds Ratios"

In statistics, it is pretty common (although admittedly not entirely unambiguous) to use log to indicate the natural logarithm (e.g., log-likelihood, but also log odds, see e.g., Wikipedia).

"Notes: For all tests, the alternative hypothesis specifies that group [...] is greater than [...]." The groups compared are the names of the rows, they don't designate a contingency table's cell name (the value of which is a bi-variate variable or count).

This indeed appears insufficient.

I think users can now expect all help screens to be complete?

I'll get to that. @tomtomme thanks! That's probably a nice starting point.