tompollard / tableone

Create "Table 1" for research papers in Python
MIT License
161 stars 38 forks source link

a bug of ValueError happened when using TableOne(data) #62

Closed Yuyoo closed 6 years ago

Yuyoo commented 6 years ago

Hi,Tom and Alistair. Long time no see since Datathon in BeiJing in 2017. How have you been doing? I found a bug in in the lastest version 0.5.6. Because of the difference of condition judgment in py2/py3, there is a bug in in line 96. The bug can cause the error when using TableOne(data). In line 96, "data[columns].columns.get_duplicates()" returns "Index([], dtype='object')". In py3, Index([], dtype='object') could not be solved as False, and would throw a ValueError: The truth value of a Index is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all(). I have test it in py2, and it works well. I suggest that we can fix it by change "data[columns].columns.get_duplicates()" to "data[columns].columns.get_duplicates().values.size", or you can solved it in other way.

tompollard commented 6 years ago

hi @Yuyoo, thanks for highlighting this issue. Please could you provide code to reproduce the problem? In Python 3, the following code returns the expected "duplicate columns" error for me:

# load sample data into a pandas dataframe

# create duplicate columns
data = data.rename(index=str, columns={"MechVent": "Height", "Weight": "Height", 
                                       "SysABP":"Age", "ICU":"Age"})

# create table
overall_table = TableOne(data)

raises the expected error:

InputError                                Traceback (most recent call last)
<ipython-input-8-332ab7cb68f8> in <module>()
      1 # create an instance of TableOne with the input arguments
      2 # firstly, with no grouping variable
----> 3 overall_table = TableOne(data)

~/projects/tableone/ in __init__(self, data, columns, categorical, groupby, nonnormal, pval, pval_adjust, isnull, ddof, labels, sort, limit, remarks)
     96         dups = data[columns].columns.get_duplicates()
     97         if dups:
---> 98             raise InputError('Input contains duplicate columns: {}'.format(dups))
    100         # if categorical not specified, try to identify categorical

InputError: Input contains duplicate columns: ['Age', 'Height']

Your suggested fix returns an error:

columns = data.columns.get_values()


AttributeError                            Traceback (most recent call last)
<ipython-input-18-adf1753aef69> in <module>()
----> 1 data[columns].columns.get_duplicates().values.size

AttributeError: 'list' object has no attribute 'values'
Yuyoo commented 6 years ago

The bug happened as: TableOne(data) C:\Users\Yuyoo\Anaconda3\lib\site-packages\ FutureWarning: 'get_duplicates' is deprecated and will be removed in a future release. You can use idx[idx.duplicated()].unique() instead dups = data[columns].columns.get_duplicates() Traceback (most recent call last): File "D:/Tianchi/meinian2/code/", line 8, in <module> print(TableOne(data)) File "C:\Users\Yuyoo\Anaconda3\lib\site-packages\", line 97, in __init__ if dups: File "C:\Users\Yuyoo\Anaconda3\lib\site-packages\pandas\core\indexes\", line 2002, in __nonzero__ .format(self.__class__.__name__)) ValueError: The truth value of a Index is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all(). Sorry, I didnt examine my method in py2. In py2, data[columns].columns.get_duplicates() return type of list, and 'list' object has no attribute 'values', I think you can change it to "len(data[columns].columns.get_duplicates())". It is universal in both py2 and py3.

tompollard commented 6 years ago

Okay, got it, thanks @Yuyoo. I now get this error after upgrading to pandas '0.23.0' (from '0.22.0').

Yuyoo commented 6 years ago

Yeah, it will update pandas defaultly when pip install --upgrade tableone. I didnt get the error when i use the old version of tableone.

tompollard commented 6 years ago

Yeah, bad timing because we just published a paper about the package! We'll get the issues fixed as soon as possible. This particular bug is fixed with:

        # check for duplicate columns
        dups = data[columns].columns[data[columns].columns.duplicated()].unique()
        if not dups.empty:
            raise InputError('Input contains duplicate columns: {}'.format(dups))

We'll work on the other issues shortly. Thanks again for raising this :)

Yuyoo commented 6 years ago

Haha, its no problem, everything will be ok. You have done a good job, its convenient for us to do research. Best wish to you!

tompollard commented 6 years ago

The following line also raises an error in Pandas 0.2.3:

grouped_data = pd.crosstab(data[self._groupby],data[v])

ValueError: Duplicated level name: "death", 
assigned to level 1, is already used for level 0.

The error is raised when the _groupby column matches v (in the case above, groupby='death' and v='death')

Odd, because it looks like this was fixed as a bug in Pandas at some point in the past:

tompollard commented 6 years ago

Fixed in version 0.5.7. Thanks again :)