munhouiani / Deep-Packet

Pytorch implementation of deep packet: a novel approach for encrypted traffic classification using deep learning
MIT License
183 stars 56 forks source link

under sampling #2

Closed zambery closed 4 years ago

zambery commented 4 years ago

Hi, I notice that in your code under sampling was made after "train and test split", and only done it with train set. Maybe this operation will make test set is bigger than train set. So this is a mistake, or there are some reasons that you did such operation?

munhouiani commented 4 years ago

No, I did it on purpose. In machine learning, we assume that the data we had collected so far are generated by an unknown underlying distribution. To assess our model generalisation capability, one of the possible ways to split train and test set is to maintain their label distribution, in this case, stratified sampling.

Under sampling is just one of model training tricks, and should only be operated on training set.

Therefore, yes, the number of training data could be less than test data after under sampling.