Open prayas7102 opened 1 month ago
Hi, following our discussion at #15 I would like to tackle this. Could you please elaborate on
Organizing data and cleaning data.
Hi, following our discussion at #15 I would like to tackle this. Could you please elaborate on
Organizing data and cleaning data.
organizing data: i was thinking if we can combine all csv data into one for training. (let me know your opinion):
cleaning data: as you can see in csv datasets there are rows in which the code contains letters, like this: \n
, //
, empty lines. example, see row no.s like 24, 29 in bruteForceDataset.csv
organizing data: i was thinking if we can combine all csv data into one for training. (let me know your opinion):
In case you mean to train just one model on the all the data combined, I think there are a couple of issues associated with this.
Due to class imbalance (i.e. Max samples: 706, min: 20) the model would be biased towards vulnerabilities with most examples present. Solution:
You'd have to switch to multi-label or multi-class classification (depending if a code snippet can contain multiple vulnerabilities) or just train binary, but without specifying which vulnerability was found exactly (just that something is wrong, but not sure if this is in the spirit of this project). Solution:
cleaning data: as you can see in csv datasets there are rows in which the code contains letters, like this: \n, //, empty lines. example, see row no.s like 24, 29 in bruteForceDataset.csv
I believe it makes sense to remove redundant/excessive characters like that. Keeping comments and line breaks where they would usually appear in code might be helpful if a more complex model is used (or e.g. bootstrapping a few BayesClassifiers; probably a separate issue) to make it more resilient to the real-world "unclean" code. I'd suggest experimenting with 2 versions of each snippet: one with and the other without these characters to see which (or, perhaps, their combination) would yield better accuracy.
Let me know your thoughts and how you'd like me to proceed!
Also, would it help to add the percentage of tests passed once they all complete?
@Commit2Cosmos
I believe it makes sense to remove redundant/excessive characters like that. Keeping comments and line breaks where they would usually appear in code might be helpful if a more complex model is used (or e.g. bootstrapping a few BayesClassifiers; probably a separate issue) to make it more resilient to the real-world "unclean" code. I'd suggest experimenting with 2 versions of each snippet: one with and the other without these characters to see which (or, perhaps, their combination) would yield better accuracy.
Let me know your thoughts and how you'd like me to proceed!
Testing this dual approach seems worthwhile. I'd say let's proceed by preparing two versions of each snippet, run the experiments, and document the findings on accuracy and generalization.
Coming to calculating percentage of tests passed, u can see in TestFolder dir there are tests for each vulnerability, and in each vulnerability there are two files like this: for brute force attack (vulnerable test files and invulnerable test files): But it's not the case with other vulnerabilities, so I'll make / rename the necessary files in the meantime.
But it's not the case with other vulnerabilities, so I'll make / rename the necessary files in the meantime.
I believe it would be useful to have a standardised naming conventions for these test files (something like BruteForce-V-1.js
, with V for vulnerable and N for not), which could then be used to calculate and display the percentage of correctly classified test samples. I could then use that for testing this dual approach and identify which vulnerabilities are easier/harder to detect. Would you like me to rename those test files and add a line at the end of the report saying "_% of tests passed"?
But it's not the case with other vulnerabilities, so I'll make / rename the necessary files in the meantime.
I believe it would be useful to have a standardised naming conventions for these test files (something like
BruteForce-V-1.js
, with V for vulnerable and N for not), which could then be used to calculate and display the percentage of correctly classified test samples. I could then use that for testing this dual approach and identify which vulnerabilities are easier/harder to detect. Would you like me to rename those test files and add a line at the end of the report saying "_% of tests passed"?
Sorry for replying late.. Sure go ahead!
While utilizing the Naive Bayes classifier to detect brute force attacks, validate inputs, identify insecure authentication, and analyze security headers, the model currently trains separately for each JavaScript test file (.js, .jsx, .tsx, etc.). This process can be streamlined by generating a single (or multiple depending upon the vulnerability) weighted pickle model, which can be reused each time a JavaScript file is tested for vulnerabilities, improving efficiency and consistency.
Steps to be considered by the contributor:
Files to be referred/altered for this change:
Make sure the end user/developer (who downloads the NPM package) is able to smoothly run the NPM package after these changes.