[Feature Request]: Add / complete help files

Still valid for 0.19.2 beta

Meta-bug to collect modules with missing/lacking help files from all issues around. Closed issue mark on references to the old issues do not mean, that those documentation is fixed now. This will be shown by marked tick boxes

One attempt to generate docs via Claude/CoPilot as a first is documented here https://github.com/jasp-stats/jasp-issues/issues/946 This involved a step-by-step procedure that

trains the AI on an existing help file and
showing the AI a complete output of a sample analysis from the data library Full template below. If a complete module has no existing help file (1) or if there is no data library jasp file (2) this method might not lead to such a good result as with "contingency tables"

Completely missing help files

[ ] Regression: Bayesian Logistic Regression
[ ] Frequencies: Contingency Tables & Log-Linear Regression
946
[ ] Factor: All
[ ] BSTS
[ ] Cochrane Meta-Analyses: All
[x] Distributions (most submodules)
[ ] JAGS
[ ] Learn Bayes: All
[ ] Learn Stats: except the cloned distribution modules
[ ] Bayesian Network Analysis
[x] All Classical Meta-Analysis modules
[ ] Meta-Analysis: Bayesian Binomial; Prediction Model Performance
[ ] Bayesian Network Analysis
[ ] Prophet
[ ] Quality Control: Probability of Detection
[ ] Reliability: Unidimensional Reliability & Bland-Altman Plots & Bayesian Unidimensional Reliability
[x] SEM & PLS
[ ] SEM Latent Growth
[x] Survival

Lacking help files

[ ] t-Test Wilcox: several formulas available regarding tie handling, asymptotic vs exact (coin in R); document what formula used #452
[ ] Bayesian t-Tests & BANOVAs, section plots, paragraph "prior and posterior" add: "The effect size delta is an estimate of the effect size distribution in the population. It is not a point estimate." from #1812
[ ] BANOVA: add "descriptive plots are based on Jeffreys's parameterisation invariant priors used for estimation." from #195
[ ] Correlation #1364
[ ] Regression: Add note that polynomial / non-linear regression are available in "Visual Modeling" module
[ ] Bayesian Regression: Explain why beta-binomial is default, see #1224
[ ] Mediation analysis without "output" section & input section to short in explanations #1152
[ ] Machine Learning: Random Forest Classification: Unclear what variable importance means. Is it permutation importance from Breiman (2001) 'Random Forests'? (from #747)

Template to train AI

I have quite an important task. I need you to write documentation on JASPs "contingency tables" module that resides in the Frequencies section. Almost every JASP module has a complete help file that explains all available options short and precise, but not this one. To insure that the documentation you will write is correct, you will follow a step by step procedure.

First step: You will state if you understand the task in general or if you need additional information. Only then we will go to the second step
Second step: I will give you an example of a complete JASP helpfile in the next prompt. You will read it and confirm when you are done. This should ensure that you will follow the same writing style. After you have confirmed reading it, I will give you the informations for step 3.
Third step: I will give you all options from JASPs "contingency tables" module through a result file, html formated. Read the result. It will contain a numerical example. It is important however, that you write the documentation without that example in mind, so that it is generalizable for other examples. You will confirm that you have read the html formated results or if you have questions before we proceed to step 4
Fourth step: You will now write the documentation for said module.

Understood?

You understood correctly. Please notice that I am writing in english and also said documentation should be in english. Here is the example. It is the help file of the module "...":

Change the module to fit the style of the main module. E.g. give it the help file of PCA to create the help file of EDA – which is not possible yet, since PCA, EDA and CDA are all missing.

Great. We will now proceed with step three. The following example analysis is from JASPs data library. It is the „Dancing Cats: Chi-squared Test of Independence“. You can find a PDF here: https://jasp-stats.org/wp-content/uploads/2020/05/The_JASP_Data_Library_1st_Edition.pdf

That PDF documents all examples from JASPs data library. Maybe it is also of help writing said documentation. Now follows the html formated file of the example. Note: I have ticked all available analysis options in JASP, thus the results will be more detailed than in the referenced PDF above. The documentation needs to cover all options, not just those referenced in the PDF:

...input the results here ...

Great. I confirm that you shall proceed. Thanks.

Nice. Can you now add

a) appropriate line breaks at the bullet points, headings and sub-headings

Step only needed, if formatting is missing.

b) Do not include the sections about ... in the revised version, since those are not implemented yet. Stick to the facts and features implemented today.

This step might lead to much more halucinations and might be to detailed overall.

c) more detail to the calculations and formulas used for the different statistical measures. But maybe only for those, where more than one formula variant exists, so that the reader knows, which variant was used. You may find hints for the actual formulas on github in jasps source code. Please do not add those calculations details, if you are unsure about them.

@tomtomme, thanks for taking the time to create this issue. If possible (and applicable), please upload to the issue website (https://github.com/jasp-stats/jasp-issues/issues/2529) a screenshot showcasing the problem, and/or a compressed (zipped) .jasp file or the data file that causes the issue. If you would prefer not to make your data publicly available, you can send your file(s) directly to us, issues@jasp-stats.org

We have a project to generate Help file from the QML info properties: each option, section or group should have an info property with some translatable text. This will generate automatically the help md file, but also the ROxygene with R Syntax.

@boutinb

Sounds like a major effort to implement. How about implementing AI-generated docs in the meantime until we have better ones? How does the quality of your doc-generator compare to the AI-generator? For reference, the following is the AI-generated documentation for the "contingency tables" module generated in minutes:

Contingency Tables The contingency tables module allows the user to test whether two categorical variables are independent or associated. The analysis returns a chi-squared test and other measures of association for each pair of variables in a contingency table. The module can also handle variables with more than two levels.

Assumptions • The variables should be categorical (nominal or ordinal). • Observations should be independent. • Expected frequencies should be at least 5 for most cells in the table.

Input • Rows: The variable that defines the rows of the contingency table. • Columns: The variable that defines the columns of the contingency table. • Counts: The variable that contains the frequency counts for each cell in the contingency table. If this is not specified, each observation is assumed to have a count of 1. • Layer: The variable that defines additional layers of the contingency table. This can be used to create a series of 2x2 tables for each level of the layer variable.

Statistics • Chi-Squared Tests: Performs a chi-squared test of independence for the contingency table. The output includes the chi-squared statistic, the degrees of freedom, the p-value, and the Vovk-Sellke maximum p-ratio. Optionally, a continuity correction can be applied for 2x2 tables, and a likelihood ratio test can be reported as well. • Log Odds Ratio: Computes the log odds ratio and its confidence interval for 2x2 tables. The output includes the log odds ratio, the lower and upper bounds of the confidence interval, and the p-value from Fisher's exact test. • Nominal: Computes measures of association for nominal variables. The output includes the contingency coefficient, the phi-coefficient, the Cramer's V, and the lambda coefficients (rows, columns, and symmetric). • Ordinal: Computes measures of association for ordinal variables. The output includes the gamma coefficient and its confidence interval, and the Kendall's tau-b coefficient, its p-value and the VS-MPR.

Output Contingency Tables • The table shows the counts of observations summarized in a contingency table. The output includes the observed count, the expected count, the percentage within row, the percentage within column, the percentage of total, the unstandardized residual, the Pearson residual, and the standardized residual for each cell in the table. The output also includes the row, column, and grand totals.

Chi-Squared Tests • Χ²: The chi-squared statistic for testing the independence of the variables. • df: The degrees of freedom for the chi-squared test. • p: The p-value for the chi-squared test. • VS-MPR: The Vovk-Sellke maximum p-ratio, which indicates the maximum possible odds in favor of the alternative hypothesis over the null hypothesis, based on the p-value. • Χ² continuity correction: The chi-squared statistic with a continuity correction, which reduces the discrepancy between the discrete and continuous distributions for 2x2 tables. • Likelihood ratio: The likelihood ratio statistic, which is based on the ratio of the maximum likelihoods of the null and alternative hypotheses. • N: The total number of observations in the contingency table.

Log Odds Ratio • Log Odds Ratio: The natural logarithm of the odds ratio, which is the ratio of the odds of an event occurring in one group to the odds of it occurring in another group. • Lower: The lower bound of the confidence interval for the log odds ratio. • Upper: The upper bound of the confidence interval for the log odds ratio. • p: The p-value from Fisher's exact test, which is based on the hypergeometric distribution of the cell counts for 2x2 tables.

Nominal • Contingency coefficient: A measure of association for nominal variables, ranging from 0 (no association) to 1 (complete association). It is based on the chi-squared statistic and the sample size. • Phi-coefficient: A measure of association for nominal variables, ranging from -1 (negative association) to 1 (positive association). It is equivalent to the Pearson correlation coefficient for dichotomous variables. • Cramer's V: A measure of association for nominal variables, ranging from 0 (no association) to 1 (complete association). It is a normalized version of the phi-coefficient that takes into account the degrees of freedom of the contingency table. • Lambda (rows): A measure of association for nominal variables, ranging from 0 (no association) to 1 (complete association). It is the proportional reduction in error when predicting the row variable from the column variable, compared to the marginal distribution of the row variable. • Lambda (columns): A measure of association for nominal variables, ranging from 0 (no association) to 1 (complete association). It is the proportional reduction in error when predicting the column variable from the row variable, compared to the marginal distribution of the column variable. • Lambda (symmetric): A measure of association for nominal variables, ranging from 0 (no association) to 1 (complete association). It is the average of the lambda coefficients for the rows and columns.

Ordinal • Gamma: A measure of association for ordinal variables, ranging from -1 (negative association) to 1 (positive association). It is based on the difference between the number of concordant and discordant pairs of observations, divided by the total number of pairs. • Standard Error: The standard error of the gamma coefficient, which is used to compute the confidence interval. • Lower: The lower bound of the confidence interval for the gamma coefficient. • Upper: The upper bound of the confidence interval for the gamma coefficient. • Kendall's Tau-b: A measure of association for ordinal variables, ranging from -1 (negative association) to 1 (positive association). It is based on the difference between the number of concordant and discordant pairs of observations, divided by the square root of the product of the number of pairs not tied on the row variable and the number of pairs not tied on the column variable. • Z: The test statistic for testing the null hypothesis that Kendall's tau-b is zero. • p: The p-value for the test of Kendall's tau-b. • VS-MPR: The Vovk-Sellke maximum p-ratio for the test of Kendall's tau-b.

@tomtomme @boutinb yes I think using some AI generation is the way to go here (and I understand from some colleagues that we won't be unique in this) - it will simply be unmanageable to have all documentation written from scratch. I would suggest to have it included in the standard module development workflow, such that the analysis creator simply has to fact-check the generation documentation.

Just a note: If there is at some point a plan on pursuing the generative-AI part, then I suggest to look at this benchmark first: https://simple-bench.com/

It is quite a hard multiple-choice benchmark where even the best AI currently only reaches half the points a normal human could do.

The ranking shows that openAIs models are currently lacking, except the paid o1-preview. Claude Sonnet 3.5 is the best free LLM atm. But this is at a constant flow. So looking at that leaderboard next week may result in other conclusions.

jasp-stats / jasp-issues