Increased concurrency results in increased GCS "cache read hit" times on Windows.

fxb commented 6 years ago

I have seen this weird behavior since we started using distributed caching using GCS with sccache, but first attributed bad "cache read hit" times to us having build machines in AWS and the distributed cache in GCS...

Now that our build machines live in the same GCP region, but the same behavior persists, I'm thinking there is an actual parallelism issue with sccache on Windows.

On our Windows build machines with 32 vCPUs, GCS "cache read hit" times are in the range of 3-4 seconds, while when I force a concurrency to 8 or 16 processes (via ninja) they drop to about 500-700 ms. gsutil perfdiag obviously shows really good numbers, since the build machine and storage bucket are both in europe-west-1, so it doesn't seem to be a network issue.

This bottleneck might be caused by some global lock or other inefficiency somewhere in the whole stack (sccache, tokio, mio, Windows itself?)... I haven't really gotten around to debug it more thoroughly, but figured I will create this ticket to see if others are experiencing the same?

I first suspected https://github.com/carllerche/mio/issues/337, and tried a manually patched version which sets the number of threads to 0 to let Windows choose, but that did not make a difference.

Probably debugging using the Windows Performance Analyzer could help find where the actual bottleneck lies.

sccache 0.2.7-142-g3d9f934
ninja 1.8.2
MSVC 15.7.1 using cl.exe

luser commented 6 years ago

We don't use the GCS backend currently so I don't think I can help you here, but if there's anything I can help you with regarding the sccache codebase please do ask! @cramertj implemented GCS support, but I don't know if he's using it on Windows.

cramertj commented 6 years ago

I'm not currently using it at all, and I've never tried it on Windows, sorry!

fxb commented 6 years ago

I plan to run a simple test next week to see if the S3 cache implementation can interface with GCS by leveraging its "simple migration" path (See: https://cloud.google.com/storage/docs/migrating) and then assessing if it performs better.

fxb commented 6 years ago

I've tested just using a disk cache now, to get some baseline performance numbers. It actually seems that the preprocessing alone basically kills all the benefits of having a cache at all. For a build that takes about 15 minutes without caching, with a disk cache we get a build time of about 14 minutes.

It would be interesting to see what kind of difference in build times other teams observe. Is MSVC even used for Firefox builds or did you completely switch to Clang on Windows?

luser commented 6 years ago

We had been using MSVC for Firefox builds, but we've switched entirely to clang-cl now. I don't have numbers handy, but I'm pretty sure our Windows builds are in the ballpark of ~20% faster with sccache enabled. I'm sure it depends on the codebase and other things, of course. Your stats show that you're getting a high hit rate, correct? The Firefox build does do some weird things like generating "unified sources" where we put source files into groups of 8 and generate a source file that #includes all 8 so we only have to invoke the compiler once per group.

fxb commented 6 years ago

We usually get 100% cache hits using the distributed cache, but the builds are actually not much faster. I guess it does depend on the codebase. We do use quite a lot of Boost, which I guess is preprocessor heavy. It can see that caching is much faster for the C files we compile.

mozilla / sccache

Increased concurrency results in increased GCS "cache read hit" times on Windows. #308