munhouiani / Deep-Packet

Pytorch implementation of deep packet: a novel approach for encrypted traffic classification using deep learning
MIT License
183 stars 56 forks source link

About the missing data set categories #27

Closed HERMIT-OuO closed 1 year ago

HERMIT-OuO commented 2 years ago

Hi.

I am trying to use ISCXVPN2016 for data preprocessing and segmentation of training and test sets. But ISCXVPN2016 does not seem to have a torrent01 item.

So I downloaded your processed dataset, but when I checked the number, I found that your dataset (category classification) is distributed as follows.

    label  count                                                                
0       0  12731
1       7  12731
2       6  12731
3       5  12731
4       1  12731
5      10  12731
6       3  12731
7       8  12731
8      11  12731
9       2  12731
10      4  12731

    label    count
0       0    23990
1       7    25344
2       6     8480
3       5   958956
4       1    13582
5      10    53498
6       3    18473
7       8     3260
8      11   179758
9       2  1236595
10      4    14258

It looks like there are only 11 categories instead of 12. I would like to ask, is it a mistake on my part?

JieJayCao commented 2 years ago

Coincidentally, I downloaded the origin ISCXVPN2016 dataset from UNB CIC website and found that the P2P:Torrent data PCAP file is indeed missing in the dataset, but many papers emphasize the application 16 classification task and the service 12 classification task.

I don't know how to deal with this problem

HERMIT-OuO commented 2 years ago

I have looked at many papers that use ISCXVPN2016 as a dataset and basically all use P2P files as a classification category. I don't know how the authors have implemented this.

:)

Pau1code commented 1 year ago

I also found this problem. Have you solved this problem now?

JieJayCao commented 1 year ago

I also found this problem. Have you solved this problem now?

You can just remove this category to do experiments without much impact. Or just use the processed dataset provided by this repository author.

munhouiani commented 1 year ago

They used to include torrent01.pcap in their dataset, but they had removed it. The full dataset was updated in 2021. I retrained the model.