usc-isi-i2 / dsbox-profiling

The data profiling TA1 component of DSBox
MIT License
5 stars 3 forks source link

Build Status

Introduction

A TA1 primitives for d3m project Currently it generates data profiles for tabular data. We use DataFrame (supported by Pandas) as our main data type.

Requirements

see setup.py

Installation

  1. install d3m project first.

  2. use pip:

pip install dsbox-dataprofiling

Usage

see example.py

you can chose the metafeautures based on what you need. Our computable metafeatures including:

computable_metafeatures = ['ratio_of_values_containing_numeric_char', 'ratio_of_numeric_values', 
    'number_of_outlier_numeric_values', 'num_filename', 'number_of_tokens_containing_numeric_char', 
    'number_of_numeric_values_equal_-1', 'most_common_numeric_tokens', 'most_common_tokens', 
    'ratio_of_distinct_tokens', 'number_of_missing_values', 
    'number_of_distinct_tokens_split_by_punctuation', 'number_of_distinct_tokens', 
    'ratio_of_missing_values', 'semantic_types', 'number_of_numeric_values_equal_0', 
    'number_of_positive_numeric_values', 'most_common_alphanumeric_tokens', 
    'numeric_char_density', 'ratio_of_distinct_values', 'number_of_negative_numeric_values', 
    'target_values', 'ratio_of_tokens_split_by_punctuation_containing_numeric_char', 
    'ratio_of_values_with_leading_spaces', 'number_of_values_with_trailing_spaces', 
    'ratio_of_values_with_trailing_spaces', 'number_of_numeric_values_equal_1', 
    'natural_language_of_feature', 'most_common_punctuations', 'spearman_correlation_of_features', 
    'number_of_values_with_leading_spaces', 'ratio_of_tokens_containing_numeric_char', 
    'number_of_tokens_split_by_punctuation_containing_numeric_char', 'number_of_numeric_values', 
    'ratio_of_distinct_tokens_split_by_punctuation', 'number_of_values_containing_numeric_char', 
    'most_common_tokens_split_by_punctuation', 'number_of_distinct_values', 
    'pearson_correlation_of_features']

for the specific meaning and data structure of the metafeature, you can lookup this page: data_metafeatures