tompollard / tableone

Create "Table 1" for research papers in Python
https://pypi.python.org/pypi/tableone/
MIT License
161 stars 38 forks source link

Support custom tests #92

Closed tompollard closed 4 years ago

tompollard commented 4 years ago

We should support user defined tests for comparing values across multiple groups (e.g. allow test to be specified with a test = {var: my_fun} argument, where var is a variable my_fun is a function.

tompollard commented 4 years ago

Custom tests are supported in the version on the master branch. e.g.

import numpy as np
import pandas as pd
from scipy import stats

# define the custom test.
# `*` allows the function to take an unknown number of arguments
def mytest(*args):
    """
    Hypothesis test for test_self_defined_statistical_tests
    """
    mytest.__name__ = "This is a custom test"
    _, pval= stats.ks_2samp(*args)
    return pval

Compare different distributions using our custom test

np.random.seed(12345678)
n1 = 200
n2 = 300

# Baseline distribution
rvs1 = stats.norm.rvs(size=n1, loc=0., scale=1)
df1 = pd.DataFrame({'rvs': 'rvs1', 'val': rvs1})

# Different to rvs1
# stats.ks_2samp(rvs1, rvs2)
# (0.20833333333333334, 5.129279597781977e-05)
rvs2 = stats.norm.rvs(size=n2, loc=0.5, scale=1.5)
df2 = pd.DataFrame({'rvs': 'rvs2', 'val': rvs2})

# Table 1 for different distributions
different = df1.append(df2)
t1_diff = TableOne(data=different, columns=["val"], pval=True, pval_test_name=True,
                   groupby="rvs", stat_test={"val": mytest})
t1_diff

Screen Shot 2020-04-30 at 12 52 49

Compare similar distributions using our custom test

# Similar to rvs1
# stats.ks_2samp(rvs1, rvs3)
# (0.10333333333333333, 0.14691437867433876)
rvs3 = stats.norm.rvs(size=n2, loc=0.01, scale=1.0)
df3 = pd.DataFrame({'rvs': 'rvs3', 'val': rvs3})

# Table 1 for similar distributions
similar = df1.append(df3)
t1_similar = TableOne(data=similar, columns=["val"], pval=True, pval_test_name=True,
                      groupby="rvs", stat_test={"val": mytest})
t1_similar

Screen Shot 2020-04-30 at 12 53 22

Compare identical distributions using our custom test

# Identical to rvs1
# stats.ks_2samp(rvs1, rvs4)
# (0.07999999999999996, 0.41126949729859719)
rvs4 = stats.norm.rvs(size=n2, loc=0.0, scale=1.0)
df4 = pd.DataFrame({'rvs': 'rvs4', 'val': rvs4})

# Table 1 for identical distributions
identical = df1.append(df4)
t1_identical = TableOne(data=identical, columns=["val"], pval=True, pval_test_name=True,
                      groupby="rvs", stat_test={"val": mytest})
t1_identical

Screen Shot 2020-04-30 at 12 53 52

alistairewj commented 4 years ago

Probably need to have fixed arguments for mytest right? e.g. two numpy arrays (x1, x2)

tompollard commented 4 years ago

For this particular example or in general? If you know how many arguments your custom function will be receiving (i.e. how many levels exist within the group) you could hardcode them. *args should work as a general solution, although ks_2samp obviously won't be happy if you try giving it anything other than two arrays.

alistairewj commented 4 years ago

I just don't see how you could have total flexibility with *args - I'm assuming TableOne is passing whatever data is associated with that variable to the test, and that it only passes it in a few different ways (grouped/ungrouped). My hunch is the below function wouldn't work:

# define the custom test.
# `*` allows the function to take an unknown number of arguments
def mytest(data0, mean0, var0, data1, mean1, var1):
    """
    Hypothesis test for test_self_defined_statistical_tests
    """
    mytest.__name__ = "This is a custom test"
    data0 = (data0 - mean0) / var0
    data1 = (data0 - mean1) / var1
    _, pval= stats.ks_2samp(data0, data1)
    return pval
tompollard commented 4 years ago

yeah, that definitely wouldn't work. i don't see a clean way of supporting custom functions that expect aggregate variables. currently the custom test just takes an array of the values in each level. e.g.

# define the custom test.
# `*` allows the function to take an unknown number of arguments
def mytest(*args):
    """
    Hypothesis test for test_self_defined_statistical_tests
    """
    mytest.__name__ = "See what I mean?"

    for n, g in enumerate(args):
        print("Group {}: {}\n".format(n, g))

    _, pval= stats.ks_2samp(*args)
    return pval

np.random.seed(1)
n1 = 10
n2 = 10

# Baseline distribution
rvs1 = stats.norm.rvs(size=n1, loc=0., scale=1)
df1 = pd.DataFrame({'strata': 'rvs1', 'val': rvs1})

# Different to rvs1
# stats.ks_2samp(rvs1, rvs2)
# (0.20833333333333334, 5.129279597781977e-05)
rvs2 = stats.norm.rvs(size=n2, loc=0.5, scale=1.5)
df2 = pd.DataFrame({'strata': 'rvs2', 'val': rvs2})

# Table 1 for different distributions
different = df1.append(df2)
t1_diff = TableOne(data=different, columns=["val"], pval=True, pval_test_name=True,
                   groupby="strata", stat_test={"val": mytest})
t1_diff

Outputs:

Group 0: [ 1.62434536 -0.61175641 -0.52817175 -1.07296862  0.86540763 -2.3015387
  1.74481176 -0.7612069   0.3190391  -0.24937038]

Group 1: [ 2.69316191 -2.59021106  0.01637419 -0.07608153  2.20065416 -1.1498369
  0.24135769 -0.81678763  0.56332062  1.37422282]
tompollard commented 4 years ago

(so in your example, you'd need to do something like this:)

# define the custom test.
# `*` allows the function to take an unknown number of arguments
def mytest(data0, data1):
    """
    Hypothesis test for test_self_defined_statistical_tests
    """
    mytest.__name__ = "This is a custom test"

    # get the aggregate values
    mean0 = np.mean(data0)
    mean1 = np.mean(data1)
    var0 = np.var(data0)
    var1 = np.var(data1)

    data0 = (data0 - mean0) / var0
    data1 = (data0 - mean1) / var1
    _, pval= stats.ks_2samp(data0, data1)
    return pval
tompollard commented 4 years ago

One thing this does make me wonder is whether custom_test is a good name for the argument. As the output is going to end up in a column titled "P-Value", should it be changed to something like pval_custom or similar? @alistairewj @jraffa

alistairewj commented 4 years ago

Maybe something like statistical_test ? I can imagine the input argument being used to specify tests to use for each variable, and accepting either a string/custom-function. So statistical_test = {'var1': 'ttest', 'var2': 'kstest2', 'var3': my_custom_function}. If the variable isn't present in the argument (or if it's None) then use the default choices.

tompollard commented 4 years ago

yep statistical_test might be good (or maybe stat_test). @jraffa ?

jraffa commented 4 years ago

I would probably call it a hypothesis test, htest? This would be mimicking R

tompollard commented 4 years ago

okay, thanks. let's go with htest and we can always change later.

tompollard commented 4 years ago

I've added an example to: https://colab.research.google.com/github/tompollard/tableone/blob/master/tableone.ipynb