usc-isi-i2 / dsbox-cleaning

The data cleaning TA1 component of DSBox
MIT License
6 stars 4 forks source link

First implementation of unfold primitive. #57

Closed RqS closed 6 years ago

RqS commented 6 years ago

First version implementation of unfold. #56

This primitive unfolds a vertically concatenated dataframe.

  1. All indices will have prediction results for all pipelines. So if group by d3mIndex, each single index will have same number of PredictedTarget for all pipeline ids. Only https://metadata.datadrivendiscovery.org/types/PredictedTarget by default.

  2. unfold_semantic_types hyperparam is a set of semantic types that the primitive will unfold. Primitive will look for columns contains those semantic_types and unfold those columns

  3. use_pipeline_id_semantic_type hyperparam is a boolean controlling whether semantictype will be used for finding pipeline id column in input dataframe. If true, it will look for https://metadata.datadrivendiscovery.org/types/PipelineId for pipeline id column, and create attribute columns using header: attribute{pipelineid}. eg. `binaryClass{a3180751-33aa-4790-9e70-c79672ce1278}. If false, create attribute columns using header: attribute_{0,1,2,...}. eg.binaryClass_0,binaryClass_1 If there are multiple columns have semantic_typehttps://metadata.datadrivendiscovery.org/types/PipelineId, always use the first one DefaultFalse Note: May need to ask for addinghttps://metadata.datadrivendiscovery.org/types/PipelineId` as new d3m semantic_type

  4. Primitive will look for PrimaryKey for grouping.

  5. If no PrimaryKey, or no columns to unfold, return original df.

  6. Columns in result df will have same metadata as the metadata in input df. Eg. binaryClass_0 column will have same metadata as binaryClass column in original df. Only different is the column name