scikit-learn-contrib / category_encoders

A library of sklearn compatible categorical variable encoders
http://contrib.scikit-learn.org/category_encoders/
BSD 3-Clause "New" or "Revised" License
2.4k stars 393 forks source link

When fit is called garbage is coming out #343

Closed AnuragAnalog closed 2 years ago

AnuragAnalog commented 2 years ago

Problem

Hey, I am using cat boost encoding from this package, to encode some of my categorical variables, but I am facing a weird issue. There is so much of output which is concatenation of the target categorys. I have uploaded everything which can be helpful for you guys to solve this issue.

Expected Behavior

The encoded column values from the encoder

Actual Behavior

rmalanomalynormalanomalynormalnormalanomalynormalnormalnormalnormalanomalyanomalynormalnormalnormalanomalynormalanomalynormalnormalnormalnormalanomalynormalnormalnormalanomalyanomalynormalnormalnormalnormalnormalnormalanomalynormalanomalyanomalynormalnormalanomalyanomalynormalnormalanomalynormalanomalynormalnormalnormalnormalnormalnormalnormalnormalanomalyanomalyanomalyanomalynormalanomalynormalnormalnormalnormalanomalyanomalynormalnormalnormalnormalnormalanomalyanomalynormalnormalnormalanomalynormalnormalnormalanomalynormalanomalynormalnormalnormalnormalnormalanomalynormalnormalanomalyanomalyanomalyanomalynormalnormalnormalanomalyanomalynormalanomalynormalanomalyanomalynormalanomalynormalanomalyanomalyanomalyanomalyanomalyanomalynormalanomalynormalanomalynormalnormalnormalnormalnormalnormalanomalyanomalyanomalyanomalynormalanomalynormalanomalynormalanomalynormalanomalynormalanomalyanomalyanomalynormalanomalyanomalynormalnormalanomalynormalanomalyanomalyanomalynormalanomalyanomalyanomalynormalnormalnormalnormalnormalanomalynormalanomalyanomalynormalanomalyanomalynormalnormalnormalnormalanomalynormalanomalyanomalynormalnormalnormalanomalynormalnormalanomalyanomalyanomalyanomalynormalanomalyanomalyanomalyanomalynormalnormalnormalnormalnormalnormalanomalyanomalynormalanomalyanomalyanomalynormalanomalynormalnormalnormalanomalyanomalynormalanomalyanomalyanomalynormalanomalynormalnormalnormalanomalynormalanomalynormalnormalnormalanomalyanomalyanomalyanomalyanomalynormalnormalnormalnormalanomalyanomalynormalnormalnormalnormalanomalyanomalynormalnormalanomalynormalanomalyanomalyanomalyanomalyanomalyanomalyanomalyanomalynormalnormalanomalyanomalynormalanomalyanomalynormalanomalynormalanomalyanomalynormalanomalynormalanomalyanomalyanomalyanomalynormalnormalanomalynormalanomalyanomalyanomalyanomalyanomalynormalnormalnormalnormalnormalanomalynormalnormalnormalnormalnormalnormalnormalnormalnormalnormalanomalynormalnormalnormalanomalyanomalynormalanomalynormalnormalanomalynormalnormalanomalyanomalynormalanomalynormalnormalnormalanomalyanomalynormalnormalanomalynormalanomalynormalnormalnormalnormalnormalanomalyanomalyanomalyanomalynormalnormalnormalanomalynormalanomalyanomalynormalnormalnormalanomalyanomalynormalanomalynormalanomalynormalnormalnormalnormalnormalanomalyanomalynormalnormalnormalnormalnormalanomalynormalnormalnormalnormalanomalyanomalynormalanomalyanomalynormalnormalnormalnormalnormalanomalynormalanomalyanomalynormalanomalyanomalynormalanomalynormalanomalynormalnormalnormalnormalnormalnormalanomalyanomalynormalanomalyanomalyanomalyanomalyanomalynormalnormalanomalynormalanomalynormalnormalanomalynormalanomalynormalanomalynormalanomalyanomalynormalanomalyanomalyanomalynormalnormalnormalnormalanomalynormalanomalynormalnormalanomalynormalanomalynormalanomalyanomalyanomalynormalnormalnormalnormalanomalyanomalyanomalynormalanomalyanomalyanomalynormalnormalnormalnormalnormalnormalanomalyanomalynormalnormalnormalanomalyanomalynormalanomalyanomalyanomalynormalnormalnormalanomalyanomalynormalanomalynormalanomalynormalanomalynormalanomalyanomalynormalanomalynormalanomalynormalanomalynormalanomalyanomalynormalanomalyanomalynormalnormalanomalyanomalyanomalynormalnormalanomalynormalnormalnormalanomalyanomalyanomalyanomalyanomalynormalnormalanomalyanomalynormalanomalynormalanomalyanomalyanomalyanomalynormalanomalynormalanomalynormalnormalnormalnormalnormalnormalnormalnormalanomalyanomalynormalanomalynormalanomalynormalanomalynormalnormalanomalynormalnormalnormalanomalynormalanomalyanomalyanomalynormalnormalanomalyanomalyanomalynormalanomalynormalnormalnormalnormalanomalyanomalynormalanomalyanomalynormalnormalnormalnormalanomalyanomalyanomalyanomalyanomalyanomalynormalanomalynormalnormalnormalnormalanomalyanomalynormalnormalnormalanomalynormalnormalnormalnormalanomalynormalnormalnormalnormalanomalynormalanomalynormalnormalanomalyanomalynormalanomalynormalanomalynormalnormalnormalanomalynormalanomalyanomalynormalanomalyanomalynormalanomalyanomalynormalnormalnormalnormalnormalnormalnormalanomalynormalanomalyanomalynormalnormalanomalynormalanomalyanomalyanomalynormalnormalnormalnormalanomalynormalanomalynormalanomalynormalnormalnormalanomalyanomalyanomalyanomalynormalnormalanomalyanomalynormalnormalnormalnormalanomalyanomalyanomalyanomalyanomalyanomalynormalnormalanomalyanomalynormalanomalynormalanomalynormalanomalynormalnormalnormalanomalyanomalyanomalyanomalyanomalynormalnormalnormalnormalanomalynormalnormalanomalynormalanomalynormalanomalynormalanomalyanomalyanomalyanomalynormalnormalnormalnormalnormalnormalnormalnormalnormalnormalnormalnormalnormalnormalnormalnormalanomalyanomalyanomalynormalanomalynormalnormalanomalynormalnormalnormalnormalanomalynormalnormalnormalnormalnormalanomalynormalanomalynormalanomalynormalnormalanomalynormal to numeric

Steps to Reproduce the Problem

#!/usr/bin/python3

import os
import numpy as np
import pandas as pd
import category_encoders as ce

def train_encode(train, cols, target_col):
    for col in cols:
        encoder = ce.cat_boost.CatBoostEncoder(verbose=0)

        encoder.fit(X=train[col].values, y=train[target_col].values)
        train[col] = encoder.transform(train[col])

    return train

def main():
    os.system("wget -O test.csv https://raw.githubusercontent.com/AnuragAnalog/issues/main/nsl_kdd_train_ready.csv")
    train = pd.read_csv('./test.csv')

    nominal_cols = ['protocol_type', 'service', 'flag']
    binary_cols = ['land', 'logged_in', 'root_shell', 'su_attempted', 'is_host_login', 'is_guest_login']
    target_cols = ['class']
    numeric_cols = list(set(train.columns) - set(nominal_cols + binary_cols + target_cols))

    train = train_encode(train, nominal_cols, 'class')

if __name__ == '__main__':
    main()

Specifications

@bollwyvl Can you help me with this issue?

PaulWestenthanner commented 2 years ago

Hi @AnuragAnalog You need to convert your target to a float rather than having it as a string. Probably you want to do something like anomaly = 1, normal = 0

AnuragAnalog commented 2 years ago

@PaulWestenthanner Thanks for the suggestion, I will try it.