usc-isi-i2 / dsbox-cleaning

The data cleaning TA1 component of DSBox
MIT License
6 stars 4 forks source link

travis ci

ISI DSBox Cleaning Primitives

The git repository containing DSBox cleaning related primitives is here. The git repository for DSBox primitives related to featurization is located here.

Data cleaning primitives

d3m.primitives.dsbox.CleaningFeaturizer

This is a multi-purpose cleaning featurizer primitive. This primitive requires metadata annotations from ISI's profiling primitive, see d3m.primitives.dsbox.Profiler below. The cleaning featurization operations supported include:

d3m.primitives.dsbox.FoldColumns

Fold multiple columns into one column based on common column name prefix. For example, fold columns with names 'month-jan', 'month-feb', 'month-mar' and so on, into one column named 'month'.

Encoding primitives

d3m.primitives.dsbox.Encoder

Performs one-hot encoding for categorical attributes. This encoder can handle missing values, and it allows user to specify the upper limit of columns to generate per cagtegorical attribute, n_limit.

d3m.primitives.dsbox.UnaryEncoder

Performs unary encoding, which useful for ordinal data.

Imputation primitives

d3m.primitives.dsbox.MeanImputation

Performs mean missing value imputation for numerical columns, and mode imputation for categorical columns.

d3m.primitives.dsbox.GreedyImputation

Performs missing value imputation by greedy search over simple imputation methods, i.e. mean, min, max, and zero.

d3m.primitives.dsbox.IterativeRegressionImputation

Performs missing value imputation by regression, then improve the imputation by iterating over columns with missing values.

Profiling Primitive

d3m.primitives.dsbox.Profiler

This primitive generates metadata by examining the given data. The types of metadata include:

Datamart Primitives

d3m.primitives.dsbox.QueryDataframe

Queries datamart for available datasets. The JSON query specification is defined Datamart Query API. The primitive returns a list of dataset metadata.

d3m.primitives.dsbox.Join

Joins two dataframes into one dataframe. The primtive takes two dataframes, left_dataframe and right_dataframe, and two lists specifing the join columns, left_columns and right_columns.