Balance the train and test sets

munhouiani / Deep-Packet

Pytorch implementation of deep packet: a novel approach for encrypted traffic classification using deep learning

MIT License

183 stars 56 forks source link

Balance the train and test sets #24

Closed dimitrov89 closed 2 years ago

dimitrov89 commented 2 years ago

If we have almost the same amount of packets for every label, can we skip the undersampling? The question is how to have almost the same amount not only for the train set, but also for the test set?

munhouiani commented 2 years ago

If we have almost the same amount of packets for every label, can we skip the undersampling?

Yes

The question is how to have almost the same amount not only for the train set, but also for the test set?

I need your context on the reason why you want to do this.

dimitrov89 commented 2 years ago

The question is how to have almost the same amount not only for the train set, but also for the test set?

I need your context on the reason why you want to do this.

I have unbalanced set.

/application_classification/train.parquet
label count
  16761
  16761
  ...
/application_classification/test.parquet
label count
  57476
  4232

I have now around 15 labels (my own dataset for other application) and the test set is very unbalanced, from 4k to 57k. Doing an evaluation in this way is not precise I suppose.

munhouiani commented 2 years ago

I presume the distribution of your dataset (test set) is similar to your actual environment. So I would suggest you keep the exact distribution of the test set.

You can get the evaluation result for each individual label after the model is trained. E.g., what is the precision/recall of label 1, i.e. treating the rest of the data with other labels as the "negative sample" and the data with label 1 as the "positive samples". You should get the precision/recall for your label 1 data under such a setting. Repeat this for all labels. You will know how your model performs.