usc-isi-i2 / dsbox-cleaning

The data cleaning TA1 component of DSBox
MIT License
6 stars 4 forks source link

Unary encoder #24

Closed rpedsel closed 6 years ago

rpedsel commented 6 years ago

Preliminary version of unary encoder - may have corner cases not handled properly. Please take a look, thank you!

from dsbox.datapreprocessing.cleaner import UnaryEncoder

train_x = pd.read_csv(train_dataset)
test_x = pd.read_csv(test_dataset)

ue = UnaryEncoder()
ue.set_training_data(inputs=train_x, targets=['col1','col2'])
ue.fit()
result = ue.produce(inputs=train_x)

p = ue.get_params()
ue2 = UnaryEncoder()
ue2.set_params(params=p)
result2 = enc2.produce(inputs=test_x)
kyao commented 6 years ago

The logic of the code looks good me. A few of issues with respect to the latest API:

  1. UnsupervisedLearnerPrimitveBase now takes an additional Hyperparameter value. Let's use the None value for now. We can go back and fix it later.

  2. The 'Input' and 'Output' should not be set equal to pd.DataFrame. They should be set to this DataFrame: from d3m_metadata.container.pandas import DataFrame Input = DataFrame Output = DataFrame

  3. This d3m DataFrame is defined to be a subtype of Sequence. So, we can replace 'inputs: Sequence[Input]' with 'inputs: Input'.