tompollard / tableone

Create "Table 1" for research papers in Python
https://pypi.python.org/pypi/tableone/
MIT License
164 stars 41 forks source link

Add `drop_na = True` flag or similar #2

Closed alistairewj closed 7 years ago

alistairewj commented 7 years ago

At the moment it seems that any NaN values will result in the output summary to be nan (nan).

tompollard commented 7 years ago

I agree that handling of NaNs could be improved (e.g. being explicit about whether or not they are included in the report), but I can't reproduce this particular issue. It may have been fixed since your comment (though the commit that's referenced above actually relates to a different issue).

e.g.:

import pandas as pd
import numpy as np
from tableone import TableOne

n = 10000
data_sample = pd.DataFrame(index=range(n))

mu, sigma = 10, 1
data_sample['normal'] = np.random.normal(mu, sigma, n)

# try np.nan
data_sample['a_nan'] = np.random.normal(mu, sigma, n)
data_sample['a_nan'].loc[1] = np.nan

# try adding list containing None
a_none = range(n)
a_none[1] = None
data_sample['a_none'] = a_none

TableOne(data_sample, continuous = ['normal', 'a_nan', 'a_none'])

Outputs:

Overall
                     overall
-------------------  -----------------
n                    10000
normal (mean (std))  10.00 (1.00)
a_nan (mean (std))   10.00 (1.01)
a_none (mean (std))  5000.00 (2886.61)
tompollard commented 7 years ago

Seems okay with grouping too:

data_sample['cats'] = np.random.choice(range(0, 2), n)
TableOne(data_sample, continuous = ['normal', 'a_nan', 'a_none'], strata_col= 'cats')

outputs:

Stratified by cats
                     0                  1
-------------------  -----------------  -----------------
n                    4922               5078
normal (mean (std))  10.02 (1.01)       9.98 (0.99)
a_nan (mean (std))   10.01 (1.00)       9.98 (1.01)
a_none (mean (std))  5085.98 (2900.32)  4916.65 (2871.08)
tompollard commented 7 years ago

Trying categorical variable containing NaN:

data_sample['nyan_cats'] = np.random.choice(range(0, 2), n)
data_sample['nyan_cats'].loc[1] = None
TableOne(data_sample, continuous = ['normal', 'a_nan', 'a_none'], 
    categorical = ['cats'], strata_col= 'nyan_cats' )

Outputs:

Stratified by nyan_cats
                     0.0                1.0
-------------------  -----------------  -----------------
n                    4959               5040
normal (mean (std))  9.99 (0.99)        10.01 (1.00)
a_nan (mean (std))   10.00 (1.00)       9.99 (1.01)
a_none (mean (std))  5036.44 (2885.80)  4964.15 (2887.24)
cats (n (%))
0                    2524 (50.90)       2398 (47.58)
1                    2435 (49.10)       2642 (52.42)
tompollard commented 7 years ago

It seems this was fixed.