tensorflow / skflow

Simplified interface for TensorFlow (mimicking Scikit Learn) for Deep Learning
Apache License 2.0
3.18k stars 441 forks source link

faster data cleaning #158

Closed kengz closed 8 years ago

kengz commented 8 years ago

Can we have 2 functions commonly used for data cleaning: fillna() and LabelEncoder(), but implement a Multi-column version for each that works directly on the entire data frame X rather than column-by-column.

MultiFillna(X, str_val='NA', num_val=0) would perform column-wise fillna() on X using the stated/default values, 'NA' for string columns and 0 for numerical columns. This is especially useful when we have X with a mix-match of str/number columns and wish to do fillna() in one go.

MultiLabelEncoder is especially useful for applying fit_transform to each column with mentioned header, and its reverse_transform would apply the inverse. This can be saved with the model at classifier.save(path), and restored for direct usage with classifier.restore(path).

For example, for the titanic data, one can do prediction by loading the model with the MultiLabelEncoder, and input x=['male', 22, 1, 7.25], then do predict(x) that internally uses the encoder to transform x.

ilblackdragon commented 8 years ago

This is actually possible to do now with FeatureColumns (https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/layers/python/layers/feature_column.py#L34) Specifically see sparse_column_with_keys.

Let us know how it works (you can use tracker for https://github.com/tensorflow/tensorflow/).