Closed flowerhack closed 4 days ago
Hi,
Thanks for the above. It's not immediately clear to me what is taking more time, as it's not obvious for me where I can find any of the console output. Is it the tracing step or the finalization step that is taking more time, or both?
Note that with both 2.18.2 and 2.18.3 you're likely running into this issue: https://github.com/github/codeql/issues/17332. We fixed one possible cause of this (the fix for which will land in 2.19.0; to be released either this week or next), but we have not yet been able to confirm that the fix works for Chromium.
There isn't any output available for the purple builds, unfortunately, since they timed out before they had a chance to upload their logs.
However, we did have one build run to completion while running CodeQL version 2.18.2, which I think will be helpful for debugging. You can view those logs by going to https://ci.chromium.org/ui/p/chromium/builders/ci/linux-codeql-generator/2332/overview, unchecking "Hide Succeeded Steps", and click "CodeQL databases with logs" next to step 19, "gsutil upload (2)". (You will need some kind of Google login to download, but any Google account should do, e.g. a gmail or similar. Let me know if you have any issues downloading the logs and we can come up with something else.)
At a cursory glance, the logs do indicate that the "TRAP import" step (which I believe is the lion's share of the finalization step?) takes ~30min longer on 2.18.2 than 2.18.1, which is significant, but I don't think explains most of the difference, since an extra 30min wouldn't lead to a doubling of the total time.
Thanks for the link.
At a cursory glance, the logs do indicate that the "TRAP import" step (which I believe is the lion's share of the finalization step?)
That's correct.
takes ~30min longer on 2.18.2 than 2.18.1, which is significant, but I don't think explains most of the difference, since an extra 30min wouldn't lead to a doubling of the total time.
Agreed.
In general we expect a traced build to be 4x~5x slower than a regular build. From what you write it seems to be more than that, is that correct? One of the main reasons this might happen is memory pressure. The extractor can consume quite a lot of memory in certain cases (which is not something we can easily fix), which can lead to swapping. I'm not sure how much insight you have in memory usage on your build servers, but this is probably the first thing we should check. If it is a memory pressure issue, then the general advice is to either (a) reduce the parallelism of the build, or (b) switch to a machine with more memory.
Had a prod at this out of curiosity -- it appears the actual databases have roughly doubled in size:
Build 2324 has
Running command in /b/s/w/ir/x/w/src: [/b/s/w/ir/x/w/src/third_party/ninja/ninja, -C, /b/s/w/ir/x/w/src/out/release, d8] ...
Finalizing codeql db.
Running TRAP import for CodeQL database at /b/s/w/ir/x/w/rc/codeql_dbsw08p5wmb/d8...
Grouping TRAP files by link target
Scanning TRAP files
Assembling TRAP files
Importing TRAP files
Merging relations
Finished writing database (relations: 2.03 GiB; string pool: 304.03 MiB).
Build 2332 has
Running command in /b/s/w/ir/x/w/src: [/b/s/w/ir/x/w/src/third_party/ninja/ninja, -C, /b/s/w/ir/x/w/src/out/release, d8] ...
Finalizing codeql db.
Running pre-finalize script /b/s/w/ir/x/w/codeql/cpp/tools/pre-finalize.sh in /b/s/w/ir/x/w/src.
Running command in /b/s/w/ir/x/w/src: [/b/s/w/ir/x/w/codeql/cpp/tools/pre-finalize.sh]
Running TRAP import for CodeQL database at /b/s/w/ir/x/w/rc/codeql_dbsho8v9azo/d8...
Grouping TRAP files by link target
Grouping unlinked TRAP files together
Scanning TRAP files
Importing TRAP files
Merging relations
Finished writing database (relations: 5.77 GiB; string pool: 304.03 MiB).
I note the surprising new appearance of the pre-finalize step, though this doesn't seem to have been changed recently and for C/C++ only exists for TRAP-caching purposes, which I believe is still beta and off by default? But crucially the db relations have ballooned from 2.03GiB to 5.77 GiB.
But crucially the db relations have ballooned from 2.03GiB to 5.77 GiB.
This is within the range of what I expect given the extractor issues that were fixed, which had quite some impact (especially since some were outright extractor crashes). However, it is potentially larger than it should be due to the issue underlying https://github.com/github/codeql/issues/17332.
In general we expect a traced build to be 4x~5x slower than a regular build. From what you write it seems to be more than that, is that correct?
We have about a 2x increase in build time between CodeQL 2.18.1 vs 2.18.2. However, that was just the version change. I do not believe we switched from a "regular" to a traced build with the version change. (To make sure I understand correctly: by "traced build", you mean using codeql database trace-command
, correct? That command is what we have always used.)
the general advice is to either (a) reduce the parallelism of the build, or (b) switch to a machine with more memory.
The machine this job was run on is 32 cores with 256GB of memory. I can look into whether machines with more resources are available, but I'm curious how this compares to (1) machines used by projects of a similar size that you've seen (if any), and (2) if the build time is still surprising given those specs (or if that feels about right)?
In general we expect a traced build to be 4x~5x slower than a regular build. From what you write it seems to be more than that, is that correct?
We have about a 2x increase in build time between CodeQL 2.18.1 vs 2.18.2. However, that was just the version change. I do not believe we switched from a "regular" to a traced build with the version change. (To make sure I understand correctly: by "traced build", you mean using
codeql database trace-command
, correct? That command is what we have always used.)
That's indeed what I mean by a traced build. However, apologies for not being clear enough. What I'm interested in knowing is how the traced build time compares with the build time of your normal builds (using comparable machines). I would like to establish if the time of the traced build even with the newer version of CodeQL still within the bounds of what we think is acceptable.
the general advice is to either (a) reduce the parallelism of the build, or (b) switch to a machine with more memory.
The machine this job was run on is 32 cores with 256GB of memory. I can look into whether machines with more resources are available, but I'm curious how this compares to (1) machines used by projects of a similar size that you've seen (if any), and (2) if the build time is still surprising given those specs (or if that feels about right)?
Before switching machines, it would really make sense to try to understand whether this is a memory pressure issue or whether there's something else going on. If the latter, switching machines is likely not going to help. To answer your questions:
(Sorry for the delayed reply here; was AFK for a bit.)
Ah, I understand your quesiton now.
re: build time comparison: An untraced build of Chromium takes ~5-6hours on one of our machines. A traced build using CodeQL version 2.18.1 takes 7-8hrs. The traced build using CodeQL version 2.18.2 take 14-15hrs.
It sounds like, based on what you're saying, anything up to ~25hrs for a traced Chromium build would be in the acceptable bounds for CodeQL; is that correct? (Is there more information I can provide re: pinning down whether this is memory pressure or something else?)
re: build time comparison: An untraced build of Chromium takes ~5-6hours on one of our machines. A traced build using CodeQL version 2.18.1 takes 7-8hrs. The traced build using CodeQL version 2.18.2 take 14-15hrs.
It sounds like, based on what you're saying, anything up to ~25hrs for a traced Chromium build would be in the acceptable bounds for CodeQL; is that correct?
Yes, that is correct, and since we did fix a number of fatal errors that affected a lot of files, the jump in time the traced build takes makes sense. The extracted database should be much more complete now.
(Is there more information I can provide re: pinning down whether this is memory pressure or something else?)
I think we're ok for now, since the traced build time is not excessive (although admittedly it takes a long time on a project this size). It would be good to keep an eye on this and let us know if you see another large jump the traced build time.
But crucially the db relations have ballooned from 2.03GiB to 5.77 GiB.
This is within the range of what I expect given the extractor issues that were fixed, which had quite some impact (especially since some were outright extractor crashes). However, it is potentially larger than it should be due to the issue underlying #17332.
We expect the size will have gone down again with CodeQL 2.19.0.
Hi @flowerhack,
Thanks again for your report. Were you able to test the latest 2.19 version as suggested by @jketema?
Thanks!
This issue is stale because it has been open 14 days with no activity. Comment or remove the Stale
label in order to avoid having this issue closed in 7 days.
This issue was closed because it has been inactive for 7 days.
Hi there,
I'm a committer for the Chromium project & we've been experimenting with building CodeQL databases of Chromium.
Recently, we upgraded the version of CodeQL we were using to compile our databases from 2.18.1 to 2.18.2, so we could include some fixes y'all have recently merged for bugs Chromium ran into (thank you!).
However, we noticed a huge spike in the time it takes to build a CodeQL database after making this switch—it increased the build time by ~2x (and in particular some of our builders started timing out).
We can increase builder timeouts if need be, but we'd like to understand if there's some more fundamental problem here (and if so, is it specific to Chromium or has it been seen on other projects, etc), or if that sharp of an increase in build time is considered a bug.
If it's useful for your analysis, you can view our builder here: https://ci.chromium.org/ui/p/chromium/builders/ci/linux-codeql-generator. The purple failing builds started occuring around 2331, which is when we made the version switch, and started working again at 2337 when we switched the version back.
You'll notice, if you expand out the steps, that the time on the
index_target.py
step is what is particularly taking longer. That step is primarily a wrapper around the commands to compile the CodeQL database; you can view the script here in case that's useful.I see that 2.18.3 is now available; I may try to update with that version number, in case this was a known bug that was fixed in the very next version, but I'm reporting this now on the assumption that that's not the case.
Thanks for any help you can offer!