rhiever / datacleaner

A Python tool that automatically cleans data sets and readies them for analysis.
MIT License
1.06k stars 204 forks source link

Planned functionality #1

Open rhiever opened 8 years ago

rhiever commented 8 years ago

In the immediate future, datacleaner will:

See this tweet chain for more ideas.

If anyone has more ideas, please add them here.

jaumebp commented 8 years ago

In my experience it is worth identifying ordinal variables (e.g. numerical grades) and handle then separately. In many cases these can be treated as continuous variables, but sometimes it is necessary to treat them as discrete ones. One example of this is missing value imputation. If treating them as continuous you may end up injecting fake values that then can mislead the downstream analysis.

Thanks for the project! I tested it on some of my biomedical datasets and compared the PCA before/after the cleaning. The only case where there were differences is a dataset with discrete variables (Exome sequencing) and specifically in the columns where some of the values were '0'. There was the following error message: sys:1: DtypeWarning: Columns (6,19,131,225,404,416,515,651,833,945,975,986,1265,1327,1387,1494,1541,1558,1715,1737,1854,1875,1947,1980,2015,2024,2111,2132,2140,2165,2426,2652,2667,2668,2871,2943,2978,2997,3165,3335,3634,3807,3945,4010,4018,4177,4191,4196,4243,4245,4389,4463,4553,4772,4814,4841,4962) have mixed types. Specify dtype option on import or set low_memory=False.

rhiever commented 8 years ago

Indeed, which is why I'm trying to discover how to identify ordinal vs. continuous variables. I posted this question on StackOverflow to brainstorm.

jaumebp commented 8 years ago

In our software we went with a much simpler approach. Letting the user specify a list of attributes to be treated as ordinal. Of course, an automatic solution is far more elegant :)

westurner commented 7 years ago

"Convenience function: Detect if there are non-numerical features and encode them as numerical features" https://github.com/rhiever/tpot/issues/61

westurner commented 7 years ago

Do I have to do get_dummies() all by myself? http://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html

... get_dummies() accepts a number of kwargs

westurner commented 7 years ago

Do I have to do get_dummies() all by myself?

I think it illogical to e.g. average Exterior1st in the Kaggle House Prices Dataset: the average of ImStucc and Wd Sdng seems nonsensical?

westurner commented 7 years ago

CSVW as JSONLD may be a good way to specify a dataset header with the relevant metadata for such operations? https://github.com/pandas-dev/pandas/issues/3402

rhiever commented 7 years ago

You should be able to use the sklearn OneHotEncoder to get the equivalent of the pandas get_dummies().

westurner commented 7 years ago

You should be able to use the sklearn OneHotEncoder to get the equivalent of the pandas get_dummies().

http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html

Is there a way to specify that I only need certain columns to be expanded into multiple columns w/ OneHotEncoder?

rhiever commented 7 years ago

See the docs you linked and the categorical_features parameter.

westurner commented 7 years ago

Do I need to write a FunctionTransformer to stack multiple preprocessing modules?

westurner commented 7 years ago

Do I need to write a FunctionTransformer to stack multiple preprocessing modules?

i.e for different columns. Or just run autoclean multiple times?

rhiever commented 7 years ago

Running autoclean multiple times might be the easier solution. Might be a useful extension to autocleaner to allow the user to pass multiple preprocessors in a list.

westurner commented 7 years ago

Might be a useful extension to autocleaner to allow the user to pass multiple preprocessors in a list.

https://github.com/paulgb/sklearn-pandas DataFrameMapper supports various combinations of columns and transformations.

westurner commented 7 years ago

It may be worth noting that pandas Categoricals have an ordered=True parameter. http://pandas.pydata.org/pandas-docs/stable/categorical.html#sorting-and-order

Does specifying the Categoricals have a different effect than inferring the ordinals from the happenstance sequence of strings in a given dataset?

adrose commented 7 years ago

any plans to impute NA's rather then replace continuous variables with the median value?

rhiever commented 7 years ago

@adrose, do you mean via model-based imputation?

adrose commented 7 years ago

@rhiever sorry should have been A LOT more specific, but yes something similar to what the Amelia command is doing in this R package - i.e. (bootstrapped linear regression).

Happy to expand on it more, or would be excited to see if you have any thoughts on this function if you think it may be applicable.

westurner commented 7 years ago