hackforla / website

Hack for LA's website
https://www.hackforla.org
GNU General Public License v2.0
315 stars 752 forks source link

Update codeql.yml to exclude YAML front-matter and Liquid code #6548

Open gaylem opened 5 months ago

gaylem commented 5 months ago

Overview

Many of our Javascript and HTML code files cannot be scanned by CodeQL as-is because they contain non-JS Liquid code {% ... %} or YAML front matter --- ... ---, which cause syntax errors. We need to try and resolve these errors without removing all non-JS code.

Details

The error message "Could not process some files due to syntax errors" indicates that these "syntax errors" may prevent CodeQL from scanning the files below (see issue #5234 for details).

Screenshot: CodeQL error message ![CodeQL error message 1](https://github.com/hackforla/website/assets/106627338/a02cc3a2-6219-48d0-b375-984f5dd0a7ac)

Simply deleting the Liquid lines would break the site (and CodeQL raised those errors accordingly in testing), so an alternative, holistic solution is required.

Action Items

Testing

Resources/Instructions

Possible Solutions

Here are two possible solutions (in order of preference) to this problem. Please use your best judgment, these are only recommendations.

Option 1

This approach is preferred because it is

Create a file named exclude-patterns.ql

import javascript

from File file
where (file.getExtension() = "js" or file.getExtension() = "html")
  and not file.getCode().matches(".*\\{%.*%\\}.*") // Exclude Liquid code
  and not file.getCode().matches(".*---.*")        // Exclude YAML front matter
select file

Then modify codeql-scan-job.yml file to use the new query file for analysis. Update the queries section in the Initialize CodeQL step to include the new query file:

# On codeql-scan-job.yml file:

- name: Initialize CodeQL
     uses: github/codeql-action/init@v3
     with:
       languages: ${{ matrix.language }}
       queries: path/to/exclude-patterns.ql, security-and-quality

Option 2

Exclude liquid code and YAML front matter patterns from the CodeQL analysis within `codeql-scan-job.yml`

``` # On codeql-scan-job.yml file: - name: Perform CodeQL Analysis uses: github/codeql-action/analyze@v3 with: languages: javascript queries: security-and-quality # Exclude Liquid and YAML code within JavaScript files exclude: | path: "**/*.{js,html}" patterns: - pattern: | // Start of Liquid code {% if variable %} // Liquid code here {% endif %} // End of Liquid code - pattern: | // Start of YAML front matter --- # YAML front matter here --- // End of YAML front matter ```

github-actions[bot] commented 5 months ago

Hi @gaylem.

Please don't forget to add the proper labels to this issue. Currently, the labels for the following are missing:

NOTE: Please ignore this comment if you do not have 'write' access to this directory.

To add a label, take a look at Github's documentation here.

Also, don't forget to remove the "missing labels" afterwards. To remove a label, the process is similar to adding a label, but you select a currently added label to remove it.

After the proper labels are added, the merge team will review the issue and add a "Ready for Prioritization" label once it is ready for prioritization.

Additional Resources:

gaylem commented 5 months ago

FYI, I'm going to chat with @roslynwythe about this soon to confirm if I did it correctly.

roslynwythe commented 5 months ago

@gaylem Thank you for writing this issue and for proposing some promising solutions to enable us to exclude liquid and yaml code from the CodeQL scan.

@gaylem @ExperimentsInHonesty There is another factor we need to consider when selecting an approach for this ER; that is, we are planning to ask developers to run CodeQL locally using the "CodeQL for VS Code" extension. I have been holding off on writing an issue to configure and document use of "CodeQL for VS Code" extension, until we had a solution to the problem of excluding liquid and yaml code. Ideally in this ER we will select an approach that can be utilized in the extension as well as in the GitHub workflow. According to the documentation Setting up CodeQL in Visual Studio there are options for such customization, but the setup is a bit complex.

I would like to hear your thoughts on this idea:

  1. Let's remove mention of the third option ("adding a build step to the codeql.yml file to run the Jekyll build before the analyze step is triggered") since it would not be compatible with local testing, and it sounds like it won't be effective anyway.
  2. Let's add an Action item to setup "CodeQL for VS Code" and explore both the remaining options you describe for excluding liquid and yaml code, with the requirement that they can be utilized in the extension as well as in the GitHub workflow
gaylem commented 5 months ago

Hey @roslynwythe, I'm good with removing the third option. I don't know much about CodeQL for VS Code but if it's agreed that we want that in here as well, I'll be happy to investigate and add it.

ExperimentsInHonesty commented 5 months ago

@gaylem thank you for putting so much work into this issue. It is written very clearly! Here are a few notes for you and @roslynwythe

I am open to being wrong about the need to break these into smaller issues, and if that is the case, please explain in a comment.

gaylem commented 5 months ago

Hi @ExperimentsInHonesty, thank you for your feedback!

@roslynwythe and I met to discuss this issue today. Here's where we landed:

As for the epic (#6378), Roslyn wants to discuss it in a meeting to see what you think.

At any rate, this seems like this is a good chance to consolidate things better. I'm still getting a feel for how you organize things so I'm open to guidance and would love to discuss more in a meeting as well if there's time.

Thanks again! 😄

roslynwythe commented 5 months ago
ExperimentsInHonesty commented 4 months ago

@roslynwythe

There seems like there is content in this issue now that does not exist in the other issues. So I think this just has to wait until we can have a three-way meeting with @gaylem you and me.

github-actions[bot] commented 4 months ago

Hi @aidanwsimmons, thank you for taking up this issue! Hfla appreciates you :)

Do let fellow developers know about your:- i. Availability: (When are you available to work on the issue/answer questions other programmers might have about your issue?) ii. ETA: (When do you expect this issue to be completed?)

You're awesome!

P.S. - You may not take up another issue until this issue gets merged (or closed). Thanks again :)

aidanwsimmons commented 4 months ago

Availability: weekday afternoons ETA: 5/9/24

ExperimentsInHonesty commented 4 months ago

@aidanwsimmons I noticed the closed PR. How are things going on this issue. Do you need some help...

Please provide update

  1. Progress: "What is the current status of your project? What have you completed and what is left to do?"
  2. Blockers: "Difficulties or errors encountered."
  3. Availability: "How much time will you have this week to work on this issue?"
  4. ETA: "When do you expect this issue to be completed?"
  5. Pictures or links* (if necessary): "Add any pictures or links that will help illustrate what you are working on."

p.s., I added a ready for product label to this issue, so I can check back in with you.

aidanwsimmons commented 4 months ago
  1. Progress: I created a new query file to exclude YAML front matter and Liquid code when scanning js and html files. I added the new query file to the queries list in the Initialize CodeQL step in codeql-scan-job-yml.
  2. Blockers: Running the codeql.yml workflow produces the error message: "Could not process some files due to syntax errors" for the same files mentioned in the issue body. See screenshot of logs below. Upon inspecting logs- it seems as though the query file is not correctly filtering YAML front matter or Liquid code. Ive tried adjusting my query file to account for additional white space / new lines, but this has produced the same error message for the same files.
  3. Availability: all day 5/10, afternoons 5/11 onwards
  4. ETA: I hope to find a solution by the end of Sunday, but if I cant make further progress- im happy to pass the issue off to someone else
  5. Pictures: Screenshot 2024-05-09 at 8 11 42 PM
aidanwsimmons commented 4 months ago

Update- I have been unable to implement a query file for the codeql scanning process that can exclude YAML front matter and Liquid code blocks. Unassigning myself from the issue as I am unfortunately too busy this week to dedicate the proper time to this issue. Im attaching a list of other possible methods to resolve the issue(though i have not assessed each of their viability.

Selective Scanning: Instead of trying to scan the entire file as JavaScript, you can selectively scan only the JavaScript portions. You can achieve this by programmatically separating JavaScript code blocks from Liquid or YAML blocks.

Escaping: You can use escaping mechanisms to prevent the Liquid or YAML code from being interpreted as part of the JavaScript syntax. For example, you might wrap Liquid or YAML code blocks within comments or strings to ensure they are ignored by the JavaScript parser.

Preprocessing: Before scanning the files with CodeQL, you can preprocess them to temporarily remove or comment out the Liquid or YAML code. Once the scanning is complete, you can restore the removed sections.

Custom Parsing: Develop a custom parser that can handle both JavaScript and Liquid/YAML syntax within the same file. This approach might be more complex but allows for a more seamless integration of different code types.

Plugin/Extension: Check if there are any existing plugins or extensions for CodeQL or other static analysis tools that support mixed-language parsing or can be extended to support it.

roslynwythe commented 4 months ago

@aidanwsimmons thank you for providing a list of possible solutions. I did explore the "Escaping" solution, and I saw that some tools such as ESLint do support excluding portions of code via specailly formatted JS comments, but I couldn't find anything in CodeQL documentation to indicate support for such a solution. Are you suggesting that in codeql-scan-job.yml, we could define such a specially formatted comment? Regarding the "Selective Scanning" solution you listed, are you suggesting that we break up the javascript into modules and exclude the modules that contain YAML/liquid code from scanning in codeql-scan-job.yml?

gaylem commented 4 months ago

@roslynwythe if a programmatic solution seems elusive, could we just dismiss the alerts manually?

https://docs.github.com/en/enterprise-cloud@latest/code-security/code-scanning/managing-code-scanning-alerts/managing-code-scanning-alerts-for-your-repository#dismissing-alerts

roslynwythe commented 4 months ago

@roslynwythe if a programmatic solution seems elusive, could we just dismiss the alerts manually?

https://docs.github.com/en/enterprise-cloud@latest/code-security/code-scanning/managing-code-scanning-alerts/managing-code-scanning-alerts-for-your-repository#dismissing-alerts

@gaylem It is possible to dismiss the alerts as false positives, however based on the error message "Could not process some files due to syntax errors", I'm concerned that CodeQL might stop scanning the file when it encounters non-JS code, and so we could be missing some test coverage. We've seen in some HTML files, that CodeQL will create an alert only on one instance of problematic markup and then seem to ignore a later instance of similar markup. But this is just my supposition, not the result of testing. We could create an issue for someone to test this out.

gaylem commented 4 months ago

@roslynwythe I did a little testing out of curiosity:

Liquid Code:

  1. I dismissed alert 63 (Liquid code): https://github.com/gaylem/hackforla-website/security/code-scanning/63

  2. Then added more liquid code to the file: https://github.com/gaylem/hackforla-website/blob/codeql-test/assets/js/utility/vrms-events.js

  3. I ran the scan, and it didn’t throw an alert for the new liquid code

  4. Then I added an unused global variable and it threw this error and didn’t highlight the liquid code it caught before: https://github.com/gaylem/hackforla-website/security/code-scanning/65

YAML Front Matter:

  1. I dismissed alert 13 (YAML front matter): https://github.com/gaylem/hackforla-website/security/code-scanning/13

  2. Then I added an unused global variable to this file: https://github.com/gaylem/hackforla-website/edit/codeql-test/assets/js/hamburger-nav.js

  3. I ran the codeql workflow a couple of times and it threw these alerts, but they weren’t for the variable I added. They were the same YAML front matter code.:

Conclusion: So it seems like Liquid code will be ignored and the file will still be scanned for other kinds of errors, but YAML front matter alerts will be recreated if the original alert is dismissed. In both cases, it does appear that after dismissal, the files are still being scanned.

It would probably be a good idea if one or two people tested this out, too.

roslynwythe commented 4 months ago

Thank you @gaylem for running these tests. This is very useful! I'm not able to view the links into your repository (404 error), but based on your explanation I have a question on point 3 under YAML Front Matter": If CodeQL failed to throw alerts for the unused variable in hamburger-nav.js, doesn't that indicate that it is not scanning the full code?

gaylem commented 4 months ago

@roslynwythe Step three of the YAML Front Matter section did throw errors, but only on the front matter alert code that I had previously dismissed. So it did scan the file, though I'm not sure why it didn't pick up the variable. Maybe the scan stopped at the front-matter when it was detected? Here's what I added (test2):

---
title: hamburger nav file
---

const test2 = "test" 

// Retrieve the SVG element
const svgElement = document.querySelector('#burgerImage svg');
gaylem commented 4 months ago

@roslynwythe Here are images of all the alerts since you can't see them:

Screenshot 2024-05-14 at 6 08 35 PM Screenshot 2024-05-14 at 6 08 44 PM Screenshot 2024-05-14 at 6 08 51 PM Screenshot 2024-05-14 at 6 08 57 PM
roslynwythe commented 4 months ago

@gaylem I repeated the YAML test in my repo and my results were the same. After dismissing the initial CodeQL alert on the frontmatter in hamburger-nav.js, I found:

so to me it sounds like CodeQL is not able to successfully scan code past the "error" that occurs with the frontmatter, and we shouldn't dismiss those alerts.

roslynwythe commented 3 months ago

I did have a suggestion based on a ChatGPT conversation. ChatGPT suggests that when listing queries, a "filter" type query should be listed first, before the query that scans for security and quality issues. So I would suggest switching the order in Option 1:

- name: Initialize CodeQL
     uses: github/codeql-action/init@v3
     with:
       languages: ${{ matrix.language }}
       queries: path/to/exclude-patterns.ql, security-and-quality
ChatGPT4o response

When specifying queries in a CodeQL workflow using the syntax `queries: [security-and-quality, ignoreYamlAndLiquid.ql]`, the order in which you list the queries does matter. Here's the reason: ### Execution Order - **Order of Queries**: CodeQL processes the queries in the order they are listed. This means that if you have a query that modifies or filters out certain results (like `ignoreYamlAndLiquid.ql`), it should generally be listed first. This ensures that subsequent queries (like those from `security-and-quality`) operate on the already filtered set of code. - **Chaining Effects**: If `ignoreYamlAndLiquid.ql` is designed to filter out certain lines of code before running standard security and quality checks, it should be executed before `security-and-quality`. This ensures the filtering takes place first, and the standard checks are performed only on the relevant code. ### Correct Order Given this, the correct order would be to list `ignoreYamlAndLiquid.ql` first, followed by `security-and-quality`: ```yaml queries: - ignoreYamlAndLiquid.ql - security-and-quality ``` ### Example Workflow Configuration Here’s an example of how you might specify this in a CodeQL workflow file (e.g., `.github/workflows/codeql-analysis.yml`): ```yaml name: "CodeQL" on: push: branches: [ main ] pull_request: # The branches below must be a subset of the branches above branches: [ main ] schedule: - cron: '0 0 * * 0' jobs: analyze: name: Analyze runs-on: ubuntu-latest steps: - name: Checkout repository uses: actions/checkout@v2 - name: Initialize CodeQL uses: github/codeql-action/init@v1 with: languages: javascript queries: | ./path/to/ignoreYamlAndLiquid.ql security-and-quality - name: Autobuild uses: github/codeql-action/autobuild@v1 - name: Perform CodeQL Analysis uses: github/codeql-action/analyze@v1 ``` In this configuration: 1. **Checkout repository**: Checks out the code repository. 2. **Initialize CodeQL**: Initializes the CodeQL analysis with the custom query `ignoreYamlAndLiquid.ql` followed by the standard `security-and-quality` queries. 3. **Autobuild**: Builds the project (necessary for CodeQL to understand the code structure). 4. **Perform CodeQL Analysis**: Runs the CodeQL analysis using the specified queries. By ensuring that `ignoreYamlAndLiquid.ql` runs before `security-and-quality`, you ensure that your custom filtering logic is applied before the standard security and quality checks are performed.

roslynwythe commented 3 months ago

To summarize the activity so far on this issue:

Due to the recent refactoring of CodeQL.yml, references to "CodeQL.yml" in this issue have been changed to "codeql-scan-job.yml"

github-actions[bot] commented 3 months ago

Hi @luisitocanlas, thank you for taking up this issue! Hfla appreciates you :)

Do let fellow developers know about your:- i. Availability: (When are you available to work on the issue/answer questions other programmers might have about your issue?) ii. ETA: (When do you expect this issue to be completed?)

You're awesome!

P.S. - You may not take up another issue until this issue gets merged (or closed). Thanks again :)

luisitocanlas commented 3 months ago

Updated Availability: 6/11 ~ 6/28 New ETA: EOD 6/28

Current state:

Blocker: Figuring out a way to confirm that CodeQL was able to scan the files listed above without scanning the YAML front-matter or Liquid code and without throwing the error.

What I've done so far for the blocker:

luisitocanlas commented 1 month ago

Progress update: Below are the solutions that I've tried and the errors that kept flagging. I would have to say that I did not find a solution and would be moving this issue back to the Prioritized Backlog.

Solution # 1:

Option 1 under the possible solutions that is slightly modified. ``` import javascript from File file where (file.getExtension() = "js" or file.getExtension() = "html") and not file.getCode().matches(".*\\{%-?\\s*[a-zA-Z]+.*%\\}.*") // Exclude Liquid code and not file.getCode().matches("(?s).*---.*---.*") // Exclude YAML front matter select file ```


Solution # 2:

Using predicates. ``` /** * @name Exclude YAML and Liquid Front Matter * @description Excludes YAML front matter and Liquid template sections from the analysis * @kind problem * @problem.severity warning */ import javascript /** Predicate to identify YAML front matter lines */ predicate isYamlFrontMatterLine(File f, int line) { exists ( int start, int end | start = f.getLine(1).getLineNumber() and (end = f.getLine(2).getLineNumber() or end = f.getLine(3).getLineNumber()) and line >= start and line <= end and f.getLine(start).getText().matches("---") and f.getLine(end).getText().matches("---") ) } /** Predicate to identify Liquid template sections */ predicate isLiquidTemplateLine(File f, int line) { exists ( string content | f.getLine(line).getText() = content and ( content.matches("{%.*%}") or content.matches("{{.*}}") ) ) } /** Class to represent code excluding YAML front matter and Liquid templates */ class CodeExcludingFrontMatter extends Expr { CodeExcludingFrontMatter() { this.getFile().getExtension() = "js" and not isYamlFrontMatterLine(this.getFile(), this.getLocation().getStartLine()) and not isLiquidTemplateLine(this.getFile(), this.getLocation().getStartLine()) } } from CodeExcludingFrontMatter c select c, "Code excluding YAML front matter and Liquid templates" ```


Error 1: When the first solution was implemented, this error keeps popping up. When the second solution was implemented this error didn't pop up.

Inside Perform CodeQL Analysis Log ![image](https://github.com/user-attachments/assets/7307d3b2-889e-4afd-a3a7-5cbe81bd7b25)


Error 2: For both solutions, the alerts are persistent.

Under Security Tab inside Code scanning alerts ![image](https://github.com/user-attachments/assets/f6d14792-b40f-4ee9-ae39-512d8d3ed8e8)
roslynwythe commented 3 weeks ago

@ExperimentsInHonesty @t-will-gillis Here is a summary of our options:

HackforLABot commented 3 weeks ago

Hi @duojet2ez, thank you for taking up this issue! Hfla appreciates you :)

Do let fellow developers know about your:- i. Availability: (When are you available to work on the issue/answer questions other programmers might have about your issue?) ii. ETA: (When do you expect this issue to be completed?)

You're awesome!

P.S. - You may not take up another issue until this issue gets merged (or closed). Thanks again :)

duojet2ez commented 3 weeks ago

I look forward to tackling this issue! It seems to be a challenge but also presents a good learning opportunity.

This week 8/26 I have time Mon - Thurs and then will be away labor day weekend and back the following week. My eta on this is not known at this point. I plan on reviewing the previous work on this/looking at comments.. after that I may have a better estimate.

t-will-gillis commented 2 weeks ago

Hey @duojet2ez We talked about this issue in our Monday meeting... If you look at @roslynwythe 's note above, we would like to focus most of your efforts on the last one:

we can try to modify the workflow such that CodeQL analyzes code after the Jekyll build. This article discusses such a pattern https://some-natalie.dev/blog/codeql-container-builds/ but it involves using a container job, and I'm not sure if we are comfortable with those.

duojet2ez commented 2 weeks ago

gotcha!

HackforLABot commented 1 week ago

@duojet2ez

Please add update using the below template (even if you have a pull request). Afterwards, remove the 'To Update !' label and add the 'Status: Updated' label.

  1. Progress: "What is the current status of your project? What have you completed and what is left to do?"
  2. Blockers: "Difficulties or errors encountered."
  3. Availability: "How much time will you have this week to work on this issue?"
  4. ETA: "When do you expect this issue to be completed?"
  5. Pictures (optional): "Add any pictures of the visual changes made to the site so far."

If you need help, be sure to either: 1) place your issue in the Questions/In Review column of the Project Board and ask for help at your next meeting, 2) put a "Status: Help Wanted" label on your issue and pull request, or 3) put up a request for assistance on the #hfla-site channel. Please note that including your questions in the issue comments- along with screenshots, if applicable- will help us to help you. Here and here are examples of well-formed questions.

You are receiving this comment because your last comment was before Tuesday, September 3, 2024 at 12:04 AM PST.

duojet2ez commented 1 week ago
  1. Progress: Did some research to understand site architecture (jekyll and how liquid works). Read about codeQL.
  2. Blockers: Last week I didn't have the time I was hoping to work on this so didn't do much
  3. Availability: 5 - 8 hours this week
  4. ETA: Not sure because in order to solve this I need to understand codeQL and docker and the build process.. kind of complex and will take time. But I enjoy learning this stuff on a deeper level
duojet2ez commented 1 day ago

I'll be at the meeting tomorrow to discuss this issue. I would like to edit the codeql-scan-job.yml file and test but not entirely sure how to do this