b4yuan commented 1 year ago

Summary

This pull request introduces a CodeQL workflow to enhance the security analysis of this repository.

What is CodeQL

CodeQL is a static analysis tool that helps identify and mitigate security vulnerabilities. It is primarily intra-function but does provide some support for inter-function analysis. By integrating CodeQL into a GitHub Actions workflow, it can proactively identify and address potential issues before they become security threats.

For more information on CodeQL and how to interpret its results, refer to the GitHub documentation and the CodeQL documentation (https://codeql.github.com/ and https://codeql.github.com/docs/).

What this PR does

We added a new CodeQL workflow file (.github/workflows/codeql.yml) that

Runs on every pull request (functionality to run on every push to main branches is included as a comment for convenience).
Runs daily.
Excludes queries with a high false positive rate or low-severity findings.
Does not display results for git submodules, focusing only on our own codebase.

Validation

To validate the functionality of this workflow, we have run several test scans on the codebase and reviewed the results. The workflow successfully compiles the project, identifies issues, and provides actionable insights while reducing noise by excluding certain queries and third-party code.

Using the workflow results

If this pull request is merged, the CodeQL workflow will be automatically run on every push to the main branch and on every pull request to the main branch. To view the results of these code scans, follow these steps:

Under the repository name, click on the Security tab.
In the left sidebar, click Code scanning alerts.

Is this a good idea?

We are researchers at Purdue University in the USA. We are studying the potential benefits and costs of using CodeQL on open-source repositories of embedded software.

We wrote up a report of our findings so far. The TL;DR is “CodeQL outperforms the other freely-available static analysis tools, with fairly low false positive rates and lots of real defects”. You can read about the report here: https://arxiv.org/abs/2310.00205

Review of engineering hazards

License: see the license at https://github.com/github/codeql-cli-binaries/blob/main/LICENSE.md:

Here's what you may also do with the Software, but only with an Open Source Codebase and subject to the License Restrictions provisions below:

Perform analysis on the Open Source Codebase.

If the Open Source Codebase is hosted and maintained on GitHub.com, generate CodeQL databases for or during automated analysis, CI, or CD.

False positives: We find that around 20% of errors are false positives, but that these FPs are polarized and only a few rules contribute to most FPs. We find that the top rules contributing to FPs are: cpp/uninitialized-local, cpp/missing-check-scanf, cpp/suspicious-pointer-scaling, cpp/unbounded-write, cpp/constant-comparison, and cpp/inconsistent-null-check. Adding a filter to filter out certain rules that contribute to a high FP rate can be done simply in the workflow file.

b4yuan commented 1 year ago

@gek169 You responded to another PR regarding the automation of the PRs. We've only raised PRs for repositories that we've been able to fork and test CodeQL on on a repo-by-repo basis.

b4yuan commented 1 year ago

Regarding the automatically generated code, I'll reiterate that this is the default workflow that Github provides you with if you wanted to create a CodeQL Analysis workflow under GitHub actions. It has been modified to support a build script, a script that forces the workflow to fail if there are errors, and filters to remove rules or directories for scanning.

davisjam commented 1 year ago

I reiterate that this PR contains neither spam (which by definition is useless, rather than useful) nor malware.

I have replied to the concerns of @gek169 at greater length here.

tuupola commented 1 year ago

Do you have a link to a repository which uses CodeQL where I could see how the results look like?

b4yuan commented 1 year ago

@tuupola You can look under the Security tab of this repo under Code scanning alerts to look at the results of the CodeQL Analysis from this PR if you wanted to see the results on hagl.

CodeQL results are not visible to anyone but the repo maintainers.

tuupola commented 1 year ago

@b4yuan There is nothing there. Apparently this PR first has to be merged to see the output.

tuupola commented 1 year ago

While I do believe this PR has been done with good intentions let me tell how it looks for an open source maintainer.

Problem #1. A new tool is introduced without discussing first. Additionally it takes considerable amount of time and effort to actually understand what the tool does. An example output would be a minimum. Instead I am now directed to read a marketing site and documentation site.

https://codeql.github.com/ https://codeql.github.com/docs/

Maybe you could provide an example repository with successful output and failed output of a CodeQL run?

Problem #2. The organisation where the PR originates has forked 275 repositories and anonymous users (without real name) are submitting the same PR to these repositories. This behavior is suspicious and spammy.

https://github.com/IVOES

GitHub is spammy these days. Bad actors are trying to hijack libraries. You should atleast make the PRs using a non anonymous account.

davisjam commented 1 year ago

@tuupola Thanks for your feedback, we appreciate it.

Problem #1: Agree, these could be improved. Though I'm curious if you can explain more about the difference you perceive between opening an issue (for discussion) vs. raising a PR (where discussion can still occur, but in the context of the code contribution)?

Problem #2:

Anonymous account: The student's name is Brian Yuan and his GitHub account is @b4yuan. That doesn't seem any more anonymous than @tuupola?
Forked 275 repositories: Well, we studied how well CodeQL performed on many repositories and wanted to share the tool with people since we found it works pretty well. Do you think this is inappropriate? I'd love to know an alternative approach you would have found more preferable -- academics tend to study open-source software in aggregate and so we end up with hundreds of repos that might benefit from the results of ap project.

b4yuan commented 1 year ago

@tuupola To add, here is an image of our forked repo of hagl and the CodeQL results in the Code Scanning section under Security:

Regarding 'failed output'---as long as CodeQL is configured properly, there really isn't 'failed output'. Maybe 'bad output', like false positives, but there are options to dismiss an error/warning for different reasons:

tuupola commented 1 year ago

It seems I forgot to answer to this.

Anonymous account: The student's name is Brian Yuan and his GitHub account is @b4yuan. That doesn't seem any more anonymous than @tuupola?

You might know the students by name. The rest of us do not. Here are screenshots taken when this pr was opened. I think the difference is quite obvious.

Screenshot from 2023-11-07 20-47-40

vs

Screenshot from 2023-11-07 20-47-10

tuupola / hagl

Add CodeQL Workflow for Code Security Analysis #115

Summary

What is CodeQL

What this PR does

Validation

Using the workflow results

Is this a good idea?

Review of engineering hazards