Open sudmit0802 opened 7 months ago
Hi, did you find out what is wrong? I find the dataset split code in /Deep-Packet/create_train_test_set.py
:
def split_train_test(df, test_size, under_sampling_train=True):
# add increasing id for df
df = df.withColumn("id", monotonically_increasing_id())
# stratified split
fractions = (
df.select("label")
.distinct()
.withColumn("fraction", lit(test_size))
.rdd.collectAsMap()
)
test_id = (
df.sampleBy("label", fractions, seed=9876)
.select("id")
.withColumn("is_test", lit(True))
)
df = df.join(test_id, how="left", on="id")
train_df = df.filter(col("is_test").isNull()).select("feature", "label")
test_df = df.filter(col("is_test")).select("feature", "label")
# under sampling
if under_sampling_train:
# get label list with count of each label
label_count_df = train_df.groupby("label").count().toPandas()
# get min label count in train set for under sampling
min_label_count = int(label_count_df["count"].min())
train_df = top_n_per_group(train_df, "label", min_label_count)
return train_df, test_df
But it seems correct to me.
My understanding is that only training dataset has the action of downsampling. For details, see the 'top_n_per_group' function in 'create_train_test_set.py'
There is a statement "For each of the application and traffic classification tasks, the dataset is first stratified split into train set and test set with the ratio of 80:20" in blog post https://blog.munhou.com/2020/04/05/Pytorch-Implementation-of-Deep-Packet-A-Novel-Approach-For-Encrypted-Tra%EF%AC%83c-Classi%EF%AC%81cation-Using-Deep-Learning/. But in fact the ratio for provided dataset on link https://drive.google.com/file/d/1EF2MYyxMOWppCUXlte8lopkytMyiuQu_/view?usp=sharing is 20:80, so test set much bigger than train dataset: