Closed alistairewj closed 6 years ago
I'd like to merge most of this but want to avoid the cherry-pick again because of the problems it caused last time! One of the commits - I think 5dff45a - seems to throw out the numbers in the null column and raises a warning when the grouped table is created:
# create an instance of TableOne with the input arguments
grouped_table = TableOne(data, columns, categorical, groupby, nonnormal)
/usr/local/lib/python3.6/site-packages/pandas/core/indexes/api.py:87:
RuntimeWarning: '<' not supported between instances of 'str' and 'float',
sort order is undefined for incomparable objects
result = result.union(other)
Instead of coercing string-containing fields to numbers, maybe we should improve data type checks on the input? e.g. if a column contains non-numerical values but it is not specified as categorical, we could fail with:
<column> does not appear to be numerical. Either specify as a categorical
variable or remove the non-numerical values
I think I prefer an explicit fail because it is possible that coerces may cause numbers to be misreported in certain cases. e.g. when commas are used as a separator for thousands:
"900"
"500"
"1,500"
"100"
mean is reported as 500 instead of 750. Similar issues might come up with commas a decimal separator (e.g. "1,5"
for 1.5
), fractions (e.g. 1/2
for 0.5
), etc.
Now explicitly fails and fixed a bug for counting null values!
Changes