how to make the balanced dataset?

eggpom commented 1 year ago

First of all, thank you so much for sharing your work, it has been very helpful. But I still have a small problem, I hope to get your help. I find that the balanced data file was not generated after running the code (cicids2017.py) . How can i get the balanced data? Looking forward to your reply, thank you again!

baixiaobaicai commented 1 year ago

Thank you for the author's work. I am reproducing this paper,I also encountered the same problem. May I ask how to solve it? Thank you！

foresthao commented 3 months ago

well，I hace encounter the same problem. But I think it is easy to solve. just need to resample. Please check out what i do, I just rewrite ./preprocessing/cicids2017.py: def scale() function:

def scale(self, training_set, validation_set, testing_set):
        """"""
        (X_train, y_train), (X_val, y_val), (X_test, y_test) = training_set, validation_set, testing_set

        categorical_features = self.features.select_dtypes(exclude=["number"]).columns
        numeric_features = self.features.select_dtypes(exclude=[object]).columns

        preprocessor = ColumnTransformer(transformers=[
            ('categoricals', OneHotEncoder(drop='first', sparse=False, handle_unknown='error'), categorical_features),
            ('numericals', QuantileTransformer(), numeric_features)
        ])

        # Preprocess the features
        columns = numeric_features.tolist()

        X_train = pd.DataFrame(preprocessor.fit_transform(X_train), columns=columns)
        X_val = pd.DataFrame(preprocessor.transform(X_val), columns=columns)
        X_test = pd.DataFrame(preprocessor.transform(X_test), columns=columns)

        # Preprocess the labels
        le = LabelEncoder()

        y_train = pd.DataFrame(le.fit_transform(y_train), columns=["label"])
        y_val = pd.DataFrame(le.transform(y_val), columns=["label"])
        y_test = pd.DataFrame(le.transform(y_test), columns=["label"])

        # Resample the training data to address class imbalance
        train_data = pd.concat([X_train, y_train], axis=1)  # Combine features and labels
        resampled_data = []  # List to store resampled data
        min_samples = 20000
        # Iterate over each class label
        for label_value in y_train["label"].unique():
            # Resample data for the current class
            class_data = train_data[train_data["label"] == label_value]
            # resampled_class_data = resample(class_data, n_samples=20000, random_state=123, replace=True)

            if len(class_data) < min_samples:
            # If the number of samples is less than the required minimum, perform resampling with replacement
                resampled_class_data = resample(class_data, n_samples=min_samples, random_state=123, replace=True)
            else:
                # Otherwise, perform resampling without replacement
                resampled_class_data = resample(class_data, n_samples=min_samples, random_state=123, replace=False)
            resampled_data.append(resampled_class_data)

        # Combine the resampled data for all classes
        resampled_data_cat = pd.concat(resampled_data)
        X_train_resampled = resampled_data_cat.drop("label", axis=1)
        y_train_resampled = resampled_data_cat["label"]

        return (X_train_resampled, y_train_resampled), (X_val, y_val), (X_test, y_test)

othmbela / dbn-based-nids

how to make the balanced dataset? #6