tompollard / tableone

Create "Table 1" for research papers in Python
https://pypi.python.org/pypi/tableone/
MIT License
161 stars 38 forks source link

False presentation of categorical variable when combining order argument with already factorized column #106

Closed JohannesWiesner closed 3 years ago

JohannesWiesner commented 3 years ago

When visualizing results from the Chi2-test, TableOne seems to present the p-value and the name of the test only once for the 'first row' (which in general makes sense to avoid presenting duplicate information). If I am not mistaken, it seems that the order of the appearance of the rows seems to follow an alphabetical order? So in the example below, female comes before male:

+----------------+--------+-----+-------------+-------------+-------------+-------+-------------------+
|                |        | g   | i           | t           | h           | u     | b                 |
+================+========+=====+=============+=============+=============+=======+===================+
| n              |        |     | 192         | 93          | 99          |       |                   |
+----------------+--------+-----+-------------+-------------+-------------+-------+-------------------+
| Age, mean (SD) |        | 0   | 38.6 (12.8) | 38.4 (11.6) | 38.8 (13.9) | 0.793 | Two Sample T-test |
+----------------+--------+-----+-------------+-------------+-------------+-------+-------------------+
| Gender, n (%)  | female | 0   | 48 (25.0)   | 26 (28.0)   | 22 (22.2)   | 0.453 | Chi-squared       |
+----------------+--------+-----+-------------+-------------+-------------+-------+-------------------+
|                | male   |     | 144 (75.0)  | 67 (72.0)   | 77 (77.8)   |       |                   |
+----------------+--------+-----+-------------+-------------+-------------+-------+-------------------+

Now I have this problem: I only want to present the 'male' row with the corresponding p-value and the name of the test and delete the 'female' row. For this of course, the order of the rows would have to be flipped so that 'male' row comes first and I can delete the 'female' row. At first I thought, I could change the ordering of the rows by converting my gender-column (dtype: object) into an ordered factor using:

data['sex'] = data['sex'].astype('category')
data['sex'].cat.categories = ['male','female']

But this seems to lead to a false presentation:

+----------------+--------+-----+-------------+-------------+-------------+-------+-------------------+
|                |        | g   | i           | t           | h           | u     | b                 |
+================+========+=====+=============+=============+=============+=======+===================+
| n              |        |     | 192         | 93          | 99          |       |                   |
+----------------+--------+-----+-------------+-------------+-------------+-------+-------------------+
| Age, mean (SD) |        | 0   | 38.6 (12.8) | 38.4 (11.6) | 38.8 (13.9) | 0.793 | Two Sample T-test |
+----------------+--------+-----+-------------+-------------+-------------+-------+-------------------+
| Gender, n (%)  | female | 0   | 144 (75.0)  | 67 (72.0)   | 77 (77.8)   | 0.453 | Chi-squared       |
+----------------+--------+-----+-------------+-------------+-------------+-------+-------------------+
|                | male   |     | 48 (25.0)   | 26 (28.0)   | 22 (22.2)   |       |                   |
+----------------+--------+-----+-------------+-------------+-------------+-------+-------------------+

Then while writing this issue, I discovered that there is already issue #93, so I used

table = TableOne(data,columns,categorical,groupby,pval=True,htest_name=True,
                    rename={'sex':'Gender','group':'Group','age':'Age'},order={'sex':['male','female']})

to change the order, but I still got the bug. Only after commenting out

# data['sex'] = data['sex'].astype('category')
# data['sex'].cat.categories = ['male','female']

again, I got the right results:

+----------------+--------+-----+-------------+-------------+-------------+-------+-------------------+
|                |        | g   | i           | t           | h           | u     | b                 |
+================+========+=====+=============+=============+=============+=======+===================+
| n              |        |     | 192         | 93          | 99          |       |                   |
+----------------+--------+-----+-------------+-------------+-------------+-------+-------------------+
| Age, mean (SD) |        | 0   | 38.6 (12.8) | 38.4 (11.6) | 38.8 (13.9) | 0.793 | Two Sample T-test |
+----------------+--------+-----+-------------+-------------+-------------+-------+-------------------+
| Gender, n (%)  | male   | 0   | 144 (75.0)  | 67 (72.0)   | 77 (77.8)   | 0.453 | Chi-squared       |
+----------------+--------+-----+-------------+-------------+-------------+-------+-------------------+
|                | female |     | 48 (25.0)   | 26 (28.0)   | 22 (22.2)   |       |                   |
+----------------+--------+-----+-------------+-------------+-------------+-------+-------------------+
JohannesWiesner commented 3 years ago

The same holds true also when I use

data['group'] = data['group'].astype('category')
data['group'].cat.categories = ['patient','control']

to 'switch the order of the group factor'. I wanted to use this to force tableone to first present the patient and then the control column. The bug remains, tablone ignores this refactoring and sticks to one method of creating the table (thus the numbers are switched but the column titles stay the same thus presenting false information).

tompollard commented 3 years ago

@JohannesWiesner sorry for the delay in getting to this. I have run through an example below, which I think achieves what you are trying to do:

I'm not able to reproduce the issue that you have described. Are you able to provide code along with sample data to help demonstrate the problem?

1. import tableone and load the data

import pandas as pd
import tableone

#  pd.__version__
# '1.1.0'

# tableone.__version__
# '0.7.9'

# loaddata and select columns
data = tableone.load_dataset('rhc')
data = data[['age', 'sex', 'swang1']]

print(data.head())

Output:

           age     sex  swang1
0     70.25098    Male  No RHC
1     78.17896  Female     RHC
2     46.09198  Female     RHC
3     75.33197  Female  No RHC
4     67.90997    Male     RHC

2. Create tableone (group by "swang1", default ordering)

tableone.tableone(data, groupby="swang1", pval=True)

Output:

                      Grouped by swang1                                               
                                Missing      Overall       No RHC          RHC P-Value
n                                               5735         3551         2184        
age, mean (SD)                        0  61.4 (16.7)  61.8 (17.3)  60.7 (15.6)   0.022
sex, n (%)     Female                 0  2543 (44.3)  1637 (46.1)   906 (41.5)   0.001
               Male                      3192 (55.7)  1914 (53.9)  1278 (58.5)        

3. Create tableone (group by "swang1" and change the order to show "male" first)

tableone.tableone(data, groupby="swang1", pval=True, order={"sex": ["Male", "Female"]})

Output:

                      Grouped by swang1                                               
                                Missing      Overall       No RHC          RHC P-Value
n                                               5735         3551         2184        
age, mean (SD)                        0  61.4 (16.7)  61.8 (17.3)  60.7 (15.6)   0.022
sex, n (%)     Male                   0  3192 (55.7)  1914 (53.9)  1278 (58.5)   0.001
               Female                    2543 (44.3)  1637 (46.1)   906 (41.5)        

4. Create tableone (group by "swang1" and only show "male")

tableone.tableone(data, groupby="swang1", pval=True, order={"sex": ["Male", "Female"]}, limit={"sex": 1})

Output:

                    Grouped by swang1                                               
                              Missing      Overall       No RHC          RHC P-Value
n                                             5735         3551         2184        
age, mean (SD)                      0  61.4 (16.7)  61.8 (17.3)  60.7 (15.6)   0.022
sex, n (%)     Male                 0  3192 (55.7)  1914 (53.9)  1278 (58.5)   0.001
tompollard commented 3 years ago

Sorry, I see what you are saying now. The issue is with the way categorical variables are handled. Definitely a bug...I'll try to fix this now. [UPDATE: the issue appears to be that the input dataframe is being modified by this step: data['sex'].cat.categories = ['male','female']]

JohannesWiesner commented 3 years ago

Probably it would make most sense to 'ignore' the alphabetical order for categorical dtype by default and present categorical variables based on the order that is set in my_df['my_categorical_variable'].cat.categories. The keyword argument order should only manipulate the presentation of the factor levels, not the variables themselves. If the variable has the object-dtype, alphabetical order of course would make sense.

tompollard commented 3 years ago

Sorry, just getting around to this now. The core issue seems to be the way in which categorical order is set in your example, which essentially swaps the "Male" and "Female" categories in the dataset itself. It's confusing behaviour and I'm not sure that Pandas should allow it to happen.

1. Import the data and view the first few lines

import tableone
data = tableone.load_dataset('rhc')
data = data[['age', 'sex', 'swang1']]
data.head()

        age     sex  swang1
0  70.25098    Male  No RHC
1  78.17896  Female     RHC
2  46.09198  Female     RHC
3  75.33197  Female  No RHC
4  67.90997    Male     RHC

2. Use the suggested approach to set the "sex" column as categorical and assign categories

Here we are using the method described in the original comment:

data['sex'] = data['sex'].astype('category')
data['sex'].cat.categories = ['Male', 'Female']

3. Review the data

When viewing the first few rows of the data, note that the categories have been reversed (the first patient has flipped from "Male" to "Female"):

data.head()

        age     sex  swang1
0  70.25098  Female  No RHC
1  78.17896    Male     RHC
2  46.09198    Male     RHC
3  75.33197    Male  No RHC
4  67.90997  Female     RHC
tompollard commented 3 years ago

It looks like the reorder_categories method can be used for setting the order of categories. As far as I can see, tableone deals with the categories as expected but I will do some more checking (and let me know if you spot anything that doesn't look right!).

1. Import the data and view the first few lines

import tableone
data = tableone.load_dataset('rhc')
data = data[['age', 'sex', 'swang1']]
data.head()

        age     sex  swang1
0  70.25098    Male  No RHC
1  78.17896  Female     RHC
2  46.09198  Female     RHC
3  75.33197  Female  No RHC
4  67.90997    Male     RHC

2. View using tableone

t1 = tableone.tableone(data, groupby="swang1", pval=True)
print(t1.tabulate(tablefmt = "github"))
Missing Overall No RHC RHC P-Value
n 5735 3551 2184
age, mean (SD) 0 61.4 (16.7) 61.8 (17.3) 60.7 (15.6) 0.022
sex, n (%) Female 0 2543 (44.3) 1637 (46.1) 906 (41.5) 0.001
Male 3192 (55.7) 1914 (53.9) 1278 (58.5)

3. Set "sex" as a categorical variable and order using reorder_categories

data['sex'] = data['sex'].astype('category')
data['sex'] = data['sex'].cat.reorder_categories( ['Male','Female'], ordered=True)

4. View the first few lines of the data again

As expected, the order of the sex column is unchanged:

data.head()

        age     sex  swang1
0  70.25098    Male  No RHC
1  78.17896  Female     RHC
2  46.09198  Female     RHC
3  75.33197  Female  No RHC
4  67.90997    Male     RHC

5. View again using tableone

t2 = tableone.tableone(data, groupby="swang1", pval=True)
print(t2.tabulate(tablefmt = "github"))
Missing Overall No RHC RHC P-Value
n 5735 3551 2184
age, mean (SD) 0 61.4 (16.7) 61.8 (17.3) 60.7 (15.6) 0.022
sex, n (%) Male 0 3192 (55.7) 1914 (53.9) 1278 (58.5) 0.001
Female 2543 (44.3) 1637 (46.1) 906 (41.5)

6. Display the first category (Male) only

t3 = tableone.tableone(data, groupby="swang1", pval=True, limit={"sex": 1})
print(t3.tabulate(tablefmt = "github"))
Missing Overall No RHC RHC P-Value
n 5735 3551 2184
age, mean (SD) 0 61.4 (16.7) 61.8 (17.3) 60.7 (15.6) 0.022
sex, n (%) Male 0 3192 (55.7) 1914 (53.9) 1278 (58.5) 0.001
tompollard commented 3 years ago

@JohannesWiesner based on the comments above, are we good to close this issue? (it seems that the input data itself was being modified).

JohannesWiesner commented 3 years ago

@tompollard, you're completely right, I just used the wrong method!

pandas.Series.cat.categories replaces categories using an input list.

pandas.Series.cat.reorder_categories does what I actually wanted to do.

So I introduced the problem by myself.

I guess that one's on me, sorry for for that!

tompollard commented 3 years ago

@JohannesWiesner thanks and no problem - it was a useful learning experience!