prayas7102 / NodejsSecurify

NodejsSecurify is an advanced NPM package designed to enhance the security of Node.js applications using AI/ML models. It provides a comprehensive set of security features and analysis capabilities to identify potential vulnerabilities and enforce best practices in accordance with OWASP guidelines.
https://www.npmjs.com/package/node-js-securify
MIT License
5 stars 5 forks source link

Automating Vulnerability Detection with Naive Bayes and Weighted Pickle Models #6

Open prayas7102 opened 1 month ago

prayas7102 commented 1 month ago

While utilizing the Naive Bayes classifier to detect brute force attacks, validate inputs, identify insecure authentication, and analyze security headers, the model currently trains separately for each JavaScript test file (.js, .jsx, .tsx, etc.). This process can be streamlined by generating a single (or multiple depending upon the vulnerability) weighted pickle model, which can be reused each time a JavaScript file is tested for vulnerabilities, improving efficiency and consistency.

Steps to be considered by the contributor:

  1. Organizing data and cleaning data.
  2. Training a model.

Files to be referred/altered for this change:

  1. DetectBruteForceAttack.ts
  2. DetectInputValidation.ts
  3. InsecureAuthentication.ts
  4. AnalyzeSecurityHeaders.ts
  5. Vulnerability.ts

Make sure the end user/developer (who downloads the NPM package) is able to smoothly run the NPM package after these changes.

Commit2Cosmos commented 4 weeks ago

Hi, following our discussion at #15 I would like to tackle this. Could you please elaborate on

Organizing data and cleaning data.

prayas7102 commented 4 weeks ago

Hi, following our discussion at #15 I would like to tackle this. Could you please elaborate on

Organizing data and cleaning data.

organizing data: i was thinking if we can combine all csv data into one for training. (let me know your opinion):

image

cleaning data: as you can see in csv datasets there are rows in which the code contains letters, like this: \n, //, empty lines. example, see row no.s like 24, 29 in bruteForceDataset.csv

Commit2Cosmos commented 3 weeks ago

organizing data: i was thinking if we can combine all csv data into one for training. (let me know your opinion):

In case you mean to train just one model on the all the data combined, I think there are a couple of issues associated with this.

cleaning data: as you can see in csv datasets there are rows in which the code contains letters, like this: \n, //, empty lines. example, see row no.s like 24, 29 in bruteForceDataset.csv

I believe it makes sense to remove redundant/excessive characters like that. Keeping comments and line breaks where they would usually appear in code might be helpful if a more complex model is used (or e.g. bootstrapping a few BayesClassifiers; probably a separate issue) to make it more resilient to the real-world "unclean" code. I'd suggest experimenting with 2 versions of each snippet: one with and the other without these characters to see which (or, perhaps, their combination) would yield better accuracy.

Let me know your thoughts and how you'd like me to proceed!

Commit2Cosmos commented 3 weeks ago

Also, would it help to add the percentage of tests passed once they all complete?

prayas7102 commented 3 weeks ago

@Commit2Cosmos

I believe it makes sense to remove redundant/excessive characters like that. Keeping comments and line breaks where they would usually appear in code might be helpful if a more complex model is used (or e.g. bootstrapping a few BayesClassifiers; probably a separate issue) to make it more resilient to the real-world "unclean" code. I'd suggest experimenting with 2 versions of each snippet: one with and the other without these characters to see which (or, perhaps, their combination) would yield better accuracy.

Let me know your thoughts and how you'd like me to proceed!

Testing this dual approach seems worthwhile. I'd say let's proceed by preparing two versions of each snippet, run the experiments, and document the findings on accuracy and generalization.

Coming to calculating percentage of tests passed, u can see in TestFolder dir there are tests for each vulnerability, and in each vulnerability there are two files like this: for brute force attack (vulnerable test files and invulnerable test files): image But it's not the case with other vulnerabilities, so I'll make / rename the necessary files in the meantime.

Commit2Cosmos commented 3 weeks ago

But it's not the case with other vulnerabilities, so I'll make / rename the necessary files in the meantime.

I believe it would be useful to have a standardised naming conventions for these test files (something like BruteForce-V-1.js, with V for vulnerable and N for not), which could then be used to calculate and display the percentage of correctly classified test samples. I could then use that for testing this dual approach and identify which vulnerabilities are easier/harder to detect. Would you like me to rename those test files and add a line at the end of the report saying "_% of tests passed"?

prayas7102 commented 2 weeks ago

But it's not the case with other vulnerabilities, so I'll make / rename the necessary files in the meantime.

I believe it would be useful to have a standardised naming conventions for these test files (something like BruteForce-V-1.js, with V for vulnerable and N for not), which could then be used to calculate and display the percentage of correctly classified test samples. I could then use that for testing this dual approach and identify which vulnerabilities are easier/harder to detect. Would you like me to rename those test files and add a line at the end of the report saying "_% of tests passed"?

Sorry for replying late.. Sure go ahead!