data - Githubissues

byxs2016 commented 4 years ago

Can you provide malware.csv & ALL.csv? Thank you

nicolasenciso commented 4 years ago

Hi !! I'm glad to get your interest in my work. The dataset that I used on my work, comes from this link (https://www.unb.ca/cic/datasets/url-2016.html), at the bottom of the page, there is a link to download the dataset, and that includes all the attack types and benign types. If you have any questions, please let me know and I'm going to try to help you with pleasure.

Greeting from Colombia

El jue., 27 feb. 2020 a las 22:59, byxs2016 (notifications@github.com) escribió:

Can you provide malware.csv & ALL.csv? Thank you

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/nicolasenciso/MaliciousURLsDetection/issues/1?email_source=notifications&email_token=AE33KLAT742JQWVLARYVBKDRFCDZLA5CNFSM4K5INWHKYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4IQ7XJ4Q, or unsubscribe https://github.com/notifications/unsubscribe-auth/AE33KLBPEOUD5UFM4B4JZDTRFCDZLANCNFSM4K5INWHA .

byxs2016 commented 4 years ago

Thank you very much for your reply and your data. I made a program based on machine learning to identify the malicious URL, which can input the specified URL. The recognition effect is not very ideal. I hope to refer to your program and data to train a svm.pickle / lgs.pickle or ...pickle file. Can you help me? Thanks again.

万博

北京邮电大学/本科生/网络空间安全学院

18511667004

北京

------------------ Original ------------------ From: "Nicolas Enciso"<notifications@github.com>; Date: Fri, Feb 28, 2020 01:06 PM To: "nicolasenciso/MaliciousURLsDetection"<MaliciousURLsDetection@noreply.github.com>; Cc: "byxs2016"<byxs2016@bupt.edu.cn>; "Author"<author@noreply.github.com>; Subject: Re: [nicolasenciso/MaliciousURLsDetection] data (#1)

Hi !! I'm glad to get your interest in my work. The dataset that I used on my work, comes from this link (https://www.unb.ca/cic/datasets/url-2016.html), at the bottom of the page, there is a link to download the dataset, and that includes all the attack types and benign types. If you have any questions, please let me know and I'm going to try to help you with pleasure.

Greeting from Colombia

El jue., 27 feb. 2020 a las 22:59, byxs2016 (<notifications@github.com>) escribió:

> Can you provide malware.csv & ALL.csv? Thank you > > — > You are receiving this because you are subscribed to this thread. > Reply to this email directly, view it on GitHub > <https://github.com/nicolasenciso/MaliciousURLsDetection/issues/1?email_source=notifications&email_token=AE33KLAT742JQWVLARYVBKDRFCDZLA5CNFSM4K5INWHKYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4IQ7XJ4Q>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/AE33KLBPEOUD5UFM4B4JZDTRFCDZLANCNFSM4K5INWHA> > . >

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

nicolasenciso commented 4 years ago

sure, how you want to model your feature extraction? Mine was exclusively lexical, due to speed in production.

byxs2016 commented 4 years ago

I used logistic regression to train the model and also makes an interface with Qt5.You can enter a URL to identify.

Now I want to add SVM and Bayes to compare them. I'll send you the program. Can you help me improve it? Thanks

Wan BO

------------------ Original ------------------ From: "Nicolas Enciso"<notifications@github.com>; Date: Sat, Feb 29, 2020 10:18 AM To: "nicolasenciso/MaliciousURLsDetection"<MaliciousURLsDetection@noreply.github.com>; Cc: "byxs2016"<byxs2016@bupt.edu.cn>; "Author"<author@noreply.github.com>; Subject: Re: [nicolasenciso/MaliciousURLsDetection] data (#1)

sure, how you want to model your feature extraction? Mine was exclusively lexical, due to speed in production.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

byxs2016 commented 4 years ago

https://github.com/nicolasenciso/PCAhttp I want to use yours to train the model.How does that schedule sound to you?

Wan BO

------------------ Original ------------------ From: "Nicolas Enciso"<notifications@github.com>; Date: Sat, Feb 29, 2020 10:18 AM To: "nicolasenciso/MaliciousURLsDetection"<MaliciousURLsDetection@noreply.github.com>; Cc: "byxs2016"<byxs2016@bupt.edu.cn>; "Author"<author@noreply.github.com>; Subject: Re: [nicolasenciso/MaliciousURLsDetection] data (#1)

sure, how you want to model your feature extraction? Mine was exclusively lexical, due to speed in production.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

nicolasenciso commented 4 years ago

Hi! I recommend you to do first a cross validation. Once you have your logistic regression model, you can improve it through tuning the parameters, trying different combinations of the parameters on your model, so you and the end, will have the best parameters which gives you the best detection scores. In addition, the cross validation ensures delete all possible overfitting, because in that cross validation, you make several partitions of the data, changing the set for training and testing.

nicolasenciso commented 4 years ago

Later, you can make the same with the others like SVM and Bayes, so you will have the best possible model for your data. You can have as a guide a notebook that a teacher of mine give me, in which the cross validation is explain and develop on scikit learn, is in spanish but, I'm sure it will help you. (https://github.com/nicolasenciso/cross-validation/blob/master/svm_validacion_cruzada.ipynb).

As a consideration, I recommend you to have a good machine, because the cross validation performs a lot of training-testing depending of the limits of parameters in which you'll search for the best. In the case of SVM, in my case, I discovered that the more data, the less time it consumed, so with SVM, use all of it, for the kernel Polynomial, be careful with the degree, it increases the time a lot.

And finally, try to parallelize all you can, it will saves you a lot of time, you'll need RAM, but the time is considerable less.

Good luck and I'll ready for more questions to help you

byxs2016 commented 4 years ago

Thank you very much. I will improve the model according to your suggestion.I first used kfold in sklearn.

Wan

------------------ Original ------------------ From: "Nicolas Enciso"<notifications@github.com>; Date: Sun, Mar 15, 2020 08:57 AM To: "nicolasenciso/MaliciousURLsDetection"<MaliciousURLsDetection@noreply.github.com>; Cc: "byxs2016"<byxs2016@bupt.edu.cn>; "Author"<author@noreply.github.com>; Subject: Re: [nicolasenciso/MaliciousURLsDetection] data (#1)

Later, you can make the same with the others like SVM and Bayes, so you will have the best possible model for your data. You can have as a guide a notebook that a teacher of mine give me, in which the cross validation is explain and develop on scikit learn, is in spanish but, I'm sure it will help you. (https://github.com/nicolasenciso/cross-validation/blob/master/svm_validacion_cruzada.ipynb).

As a consideration, I recommend you to have a good machine, because the cross validation performs a lot of training-testing depending of the limits of parameters in which you'll search for the best. In the case of SVM, in my case, I discovered that the more data, the less time it consumed, so with SVM, use all of it, for the kernel Polynomial, be careful with the degree, it increases the time a lot.

And finally, try to parallelize all you can, it will saves you a lot of time, you'll need RAM, but the time is considerable less.

Good luck and I'll ready for more questions to help you

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

byxs2016 commented 4 years ago

In the pcahttp project.

These are labeled data.Should supervised learning algorithms be used? Such as LDA? Why can PCA be used ? Looking forward to your reply

------------------ Original ------------------ From: "Nicolas Enciso"<notifications@github.com>; Date: Sun, Mar 15, 2020 08:57 AM To: "nicolasenciso/MaliciousURLsDetection"<MaliciousURLsDetection@noreply.github.com>; Cc: "byxs2016"<byxs2016@bupt.edu.cn>; "Author"<author@noreply.github.com>; Subject: Re: [nicolasenciso/MaliciousURLsDetection] data (#1)

Later, you can make the same with the others like SVM and Bayes, so you will have the best possible model for your data. You can have as a guide a notebook that a teacher of mine give me, in which the cross validation is explain and develop on scikit learn, is in spanish but, I'm sure it will help you. (https://github.com/nicolasenciso/cross-validation/blob/master/svm_validacion_cruzada.ipynb).

As a consideration, I recommend you to have a good machine, because the cross validation performs a lot of training-testing depending of the limits of parameters in which you'll search for the best. In the case of SVM, in my case, I discovered that the more data, the less time it consumed, so with SVM, use all of it, for the kernel Polynomial, be careful with the degree, it increases the time a lot.

And finally, try to parallelize all you can, it will saves you a lot of time, you'll need RAM, but the time is considerable less.

Good luck and I'll ready for more questions to help you

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

nicolasenciso / MaliciousURLsDetection

data #1