munhouiani / Deep-Packet

Pytorch implementation of deep packet: a novel approach for encrypted traffic classification using deep learning
MIT License
183 stars 56 forks source link

Provided train_test_set is not correct #43

Open sudmit0802 opened 7 months ago

sudmit0802 commented 7 months ago

There is a statement "For each of the application and traffic classification tasks, the dataset is first stratified split into train set and test set with the ratio of 80:20" in blog post https://blog.munhou.com/2020/04/05/Pytorch-Implementation-of-Deep-Packet-A-Novel-Approach-For-Encrypted-Tra%EF%AC%83c-Classi%EF%AC%81cation-Using-Deep-Learning/. But in fact the ratio for provided dataset on link https://drive.google.com/file/d/1EF2MYyxMOWppCUXlte8lopkytMyiuQu_/view?usp=sharing is 20:80, so test set much bigger than train dataset: image

RayCxggg commented 5 months ago

Hi, did you find out what is wrong? I find the dataset split code in /Deep-Packet/create_train_test_set.py:

def split_train_test(df, test_size, under_sampling_train=True):
    # add increasing id for df
    df = df.withColumn("id", monotonically_increasing_id())

    # stratified split
    fractions = (
        df.select("label")
        .distinct()
        .withColumn("fraction", lit(test_size))
        .rdd.collectAsMap()
    )
    test_id = (
        df.sampleBy("label", fractions, seed=9876)
        .select("id")
        .withColumn("is_test", lit(True))
    )

    df = df.join(test_id, how="left", on="id")

    train_df = df.filter(col("is_test").isNull()).select("feature", "label")
    test_df = df.filter(col("is_test")).select("feature", "label")

    # under sampling
    if under_sampling_train:
        # get label list with count of each label
        label_count_df = train_df.groupby("label").count().toPandas()

        # get min label count in train set for under sampling
        min_label_count = int(label_count_df["count"].min())

        train_df = top_n_per_group(train_df, "label", min_label_count)

    return train_df, test_df

But it seems correct to me.

pao0626 commented 3 months ago

My understanding is that only training dataset has the action of downsampling. For details, see the 'top_n_per_group' function in 'create_train_test_set.py'