tidymodels / planning

Documents to plan and discuss future development
MIT License
37 stars 4 forks source link

Idea - Statistical testing library #9

Open efullenkamp94 opened 4 years ago

efullenkamp94 commented 4 years ago

So for the past few weeks I've been mulling over the idea of a tidy style interface for statistical testing. The motivation for this came out of a few recent projects I've run at work, and this lead me to discovering a few areas to improve. The idea for this package is different than Infers approach by attempting to add as minimal work to learning the package for those coming from a non statistics background. I apologize if this seems all over the place, but currently in my heads it was created by a need, and that current packages don't meet the problems needs, and a mix of a few newer tidyverse type verbs. I'm adding this issue request for two reasons, number one to comply with the how to contribute for tidy(verse/models) packages, and to get any input from others as to ways forward. One thing I want to point out while this this may seem repetitive to the library Infer, I believe this would be a library for non statisticians (Where the syntax of Infer might seem foreign to non statisticians).

The original problem - The work project that launched this was from a survey the organization I work for ran for a government agency. We broke down the surveys based on 4 important demographics factors we also wanted to test. We ran numerous tests, including chisq, shapiro wilks, and kruskal wallis as the main three. We ran tests on each applicable question, and so all 5 (Including overall) where outputted from a function into a list.

Current approach. Using R's base Stats library we currently have multiple ways in which data is required to be input to a statistical test (Depending on the test), and can require transformations before being run. With more and more data being stored in a tidy manner, being able to compute statistical tests directly form tidy datasets is beneficial. I've heard from a number of coworkers (All non statisticians), the inconsistencies in the stats package is one of the reasons R has a steep learning curve.

My approach: Create a library that follows two principles, number one is be as simple for non statisticians to pick and use as quickly as possible, and number two is to do this while adding as minimal, if any, new verbs to the tidyverse/tidymodel universe.

Current working idea - A package tentatively called Tidy Tests.

If you read the problem statement the list aspect of return the values might not have been the most optimal way to return the data. By the end of the project, and on later ones I started using broom heavily, and creating functions to output "clean" looking data at the end. I believe the best option would be to add some sort of cleaned output that looked similar to brooms cleaning up of other things.

juliasilge commented 4 years ago

Have you looked at the package infer, which is part of the tidymodels ecosystem? There is a lot of overlap with what you described here (enough perhaps to argue for not starting a separate, new project) and you may be interested in using that package and/or collaborating with those authors.

juliasilge commented 4 years ago

As a starting point, you may want to check out this article on tidymodels.org.