Support non-standard compilation processes

mvanotti commented 4 years ago

In a recent talk with @adityasharad , he mentioned that CodeQL would try to understand when a compiler is being invoked. Some projects use goma to speed up the build process, reusing previously built artifacts.

CodeQL seems to ignore all the artifacts that are obtained via goma.

p0 commented 4 years ago

To process C++ code, we indeed need to know how to compile it. The easiest way of setting that up is codeql's default, but that indeed relies on observing compilations locally, and distributed build systems like goma or bazel will not work with that approach without disabling the "distributed" aspect.

One possible approach is to leverage compile-commands.json files, as generated by many build systems. The information in them is sufficient to drive the CodeQL tooling, but obtaining such a file is a non-standard build-system-specific and sometimes project-specific process. One possibility would be to add support in CodeQL for creating a database based on such compiler settings files. [It's worth noting that this would not be equivalent to tracing a full local build, since CodeQL takes advantage of information from linker invocations too, and those are not represented in compilation command databases.]

mvanotti commented 4 years ago

Maybe it would be good to have a way of determining what is needed for CodeQL to work properly, and then each build system could figure out how to export that data somehow. That way, having extractors for different build systems depends on the community.

Manouchehri commented 4 years ago

Using bear + https://github.com/github/codeql-cli-binaries/issues/9 would be a decent solution in my opinion. =)

p0 commented 4 years ago

I would expect CodeQL's built-in support to be able to handle any situation where bear would also work. The problem with goma or bazel is that the compilations end up being done on a server process or different machine, and are invisibile to whatever local monitoring you are trying to do of the build process.

haxmeadroom commented 3 years ago

Has CodeQL been tested to work with bazel by disabling the distributed aspect, or is this hypothetical? What changes were made to bazel to accomplish this?

To process C++ code, we indeed need to know how to compile it. The easiest way of setting that up is codeql's default, but that indeed relies on observing compilations locally, and distributed build systems like goma or bazel will not work with that approach without disabling the "distributed" aspect.

adityasharad commented 3 years ago

For Bazel, one approach that we have used successfully is the following. It can be passed as the build command to codeql database create or use as a run shell step within a GitHub Actions workflow for CodeQL code scanning.

bazel shutdown; bazel build --spawn_strategy=local --nouse_action_cache //path/to/build/targets/...

shutdown stops all locally-running Bazel servers
--spawn_strategy=local disables the distributed aspect
--nouse_action_cache disables the action cache, increasing the likelihood that all your code is recompiled during the build

More involved integration into Bazel's dependency graph is possible but not likely for the majority of use cases. Please try this and let us know if it helps.

haxmeadroom commented 3 years ago

The above approach did not work for me. I've also tried many combinations with --spawn_strategy=local, --nouse_action_cache, --batch, --action_env=LD_PRELOAD=...lib64trace.so, etc... I have fuzzgoat that builds outside of bazel and gets 12 results (in the csv file output). If I build inside bazel, it runs the 142 evaluations but returns no rows in the csv file. I also always get a warning from bazel that LD_PRELOAD is being ignore. My impression is LD_PRELOAD is required to work, right? Any ideas?

adityasharad commented 3 years ago

@haxmeadroom could you share more about the project you're building (link if it's open source) and the build commands you're using with and without Bazel?

pestophagous commented 3 years ago

I discovered CodeQL this weekend (while digging around in GitHub repo settings looking for other things).

I enabled it to see what would happen, but beyond that I have put essentially zero extra time into reconfiguring my build or trying to get CodeQL to work better on my repo.

Relevant to this bug/enhancement ticket:

My build toolchain is qmake (for now), and it seems that any cc/cpp files built in my qmake build are not being scanned.

My build also uses a submodule pointing to a different project that uses CMake, and when my build first compiles that project (which is a dependency of my app), then those files built with CMake do appear to get scanned. I know this because there are 3 warnings from the submodule codebase.

I'm actually quite pleased to see the scan including the submodule code. (After all, any vulnerabilities in the submodule will become "my" vulns after I link to that library.)

Now I just need the scan to include my code, too!

Sometime (on the weekends, for my weekend-only side project), I am willing to tweak my build script to help the scanning work.

QUESTION:

Where in the Analysis results (in GitHub web UI) or in the CI/Action log can I see a list of all the cc/cpp files that are scanned?

There must be a list (?), so I don't need to keep injecting bad code into files to see if a warning appears.

Here is the PR where I investigated that my own code does not trigger CodeQL warnings: https://github.com/pestophagous/heory/pull/46

I injected the same "Multiplication result converted to larger type" issue into my code to match the issue that I saw trigger a warning in the submodule. But the scan result says "No new or fixed alerts".

Screenshot from 2021-06-07 08-43-00

adityasharad commented 3 years ago

@pestophagous you can see a brief summary of the lines of code seen by CodeQL within your Actions logs here: https://github.com/pestophagous/heory/runs/2765497253?check_suite_focus=true#step:5:257 (Analysis summary for <language>).

We're in the process of rolling out some new features that give you additional diagnostic information about the codebase that was analysed, such as the number of files (or the list of files when running with higher verbosity). Will report back when you can try those out.

pestophagous commented 3 years ago

@adityasharad This is all I see when I follow the link to the "brief summary" that you mention:

Analysis produced the following metric data:

|                  Metric                   | Value  |
+-------------------------------------------+--------+
| Total lines of C/C++ code in the database | 775466 |
##[endgroup]
##[group]Analysis summary for cpp
Counted 605060 lines of code for cpp as a baseline.
Analysis produced the following metric data:

|                  Metric                   | Value  |
+-------------------------------------------+--------+
| Total lines of C/C++ code in the database | 775466 |

That provides a "no" answer to "can I see a list of all the cc/cpp files?"

Right? I'm not mad if the answer is "no". I just want to clearly understand if it is yes or no to make sure I didn't misunderstand or follow an incorrect link.

It's great to hear you are working on additional features! I look forward to it. I contributed to this ticket only in the spirit of "giving back" and providing more real-life test cases for the team. I'm not complaining! (How could I, this is all provided free of cost to me!)

Thanks for your reply and interest.

mvanotti commented 3 years ago

@pestophagous , I created issue #13 to track what you are asking for. There's a query that will give you the list of files that are in the database.

mvanotti commented 2 years ago

I have seen that there's a new "Indirect Tracing Mode" for building CodeQL databases. Is this the recommended way to build databases for other build environments (for example, GOMA or RBE)?

Would it be possible to just have something that parses compile_commands.json and emmits the env variables that are needed for codeql cli ?

mvanotti commented 2 years ago

Ah, my bad, the Indirect Build Tracing still tries to figure out what the extractors are. But it seems like it should detect gomacc, right?

adityasharad commented 2 years ago

:wave: For goma (as I understand it) the main requirement is that you disable the distributed aspect of the build. If the build is constrained to the local machine, then either a direct command line passed to codeql database create or a sequence of build steps wrapped by CodeQL's indirect build tracing will work. Neither of those features is designed to force the build to run locally, so you must configure your build to do so.

compile_commands.json support is something we see the need for and are discussing at the moment, with the same caveats that p0 described earlier in this issue. Will keep you updated if this makes it onto our roadmap.

mvanotti commented 2 years ago

Hi @adityasharad !

I thought codeqlcli only needed to lookup the compiler invocations of the commands. My understanding is that when compiling with goma, you just use gomacc to build, instead of your regular compiler. That's why I thought it would be somewhat doable to trace.

AIUI, RBE (Remote Build Execution) uses a similar thing, but uses a different prefix (no gomacc).

So I am wondering what would we need to get those as recognized by the codeql cli extractors.

dmivankov commented 2 years ago

Another option for bazel with strict action_env is to add following to bazelrc

# CodeQL build mode
# some vars are defined in https://github.com/github/codeql-action/blob/d7ad71d8034d228d5c8076dc7f058905e272a3fd/src/tracer-config.ts

# CodeQL needs to trace compiler via LD_PRELOAD + some other vars
build:codeql --action_env LD_PRELOAD --action_env ODASA_TRACER_CONFIGURATION --action_env SEMMLE_EXECP --action_env SEMMLE_JAVA_TOOL_OPTIONS --action_env SEMMLE_PRELOAD_libtrace --action_env SEMMLE_PRELOAD_libtrace32 --action_env SEMMLE_PRELOAD_libtrace64 --action_env SEMMLE_COPY_EXECUTABLES_ROOT

# CodeQL needs to compile everything locally and without cache
build:codeql --noremote_accept_cached --remote_upload_local_results=false --spawn_strategy=local

# Pass along CODEQL_* env vars
build:codeql --action_env CODEQL_EXEC_ARGS_OFFSET --action_env CODEQL_EXTRACTOR_JAVA_LOG_DIR --action_env CODEQL_EXTRACTOR_JAVA_RAM --action_env CODEQL_EXTRACTOR_JAVA_ROOT --action_env CODEQL_EXTRACTOR_JAVA_SOURCE_ARCHIVE_DIR --action_env CODEQL_EXTRACTOR_JAVA_THREADS --action_env CODEQL_EXTRACTOR_JAVA_TRAP_DIR --action_env CODEQL_EXTRACTOR_JAVA_WIP_DATABASE --action_env CODEQL_JAVA_HOME --action_env CODEQL_PARENT_ID --action_env CODEQL_PLATFORM --action_env CODEQL_PLATFORM_DLL_EXTENSION --action_env CODEQL_RAM --action_env CODEQL_SCRATCH_DIR --action_env CODEQL_THREADS --action_env CODEQL_DIST --action_env CODEQL_TRACER_LOG

and then use bazel build --config codeql as build command

update: above works for java 11 code under bazel 4, java 17 with bazel 5 but not for java 11 with bazel 5

j2kun commented 2 months ago

Is anyone aware of a method like @dmivankov's java approach that would suffice for bazel+cpp? I have a compile_commands.json generator available.

adityasharad commented 2 months ago

@j2kun could you first try adapting the Bazel example at https://docs.github.com/en/code-security/codeql-cli/getting-started-with-the-codeql-cli/preparing-your-code-for-codeql-analysis#specifying-build-commands (scroll down from that link and look for "Project built using Bazel"). That approach has worked well for us in many scenarios internally and externally; if it doesn't work for your use case please file a separate issue with the details for us. Thanks!

github / codeql-cli-binaries

Support non-standard compilation processes #12