nextest-rs / nextest

A next-generation test runner for Rust.
https://nexte.st
Apache License 2.0
2.19k stars 97 forks source link

Performance regression on Mac with M1 (caused by parentalcontrolsd directory scans) #262

Open johamster opened 2 years ago

johamster commented 2 years ago

Hi,

first of all thanks for building nextest it is great (except for what follows here).

We observed a performance regression in our test execution times from approx. 20-30s to >200s. So far, we don't have an explanation. We observed it yesterday but it could have gone unnoticed for a few days.

It only happens on Mac with M1 (pro) chips. Linux on x86 machines (e.g., our CI) is not affected.

So far we tried the following without success:

Further, we observed that nextest processes are spawned as expected. However we observed the following:

Any ideas? We are happy to help.

Kind regards, Johannes

sunshowers commented 2 years ago

Thanks for the report!

Based on what you describe my first thought is Gatekeeper or some other sort of macOS process creation bottleneck. It might be worth looking at Activity Monitor or top to see if there's some system process that is consuming CPU.

https://nexte.st/book/antivirus-gatekeeper.html#macos

sunshowers commented 2 years ago

Ahhh I just saw the note about parentalcontrolsd. Yes, sadly, I've seen it be responsible for high CPU usage in other contexts though I've never been able to reproduce it.

You may try following the commands mentioned at: https://github.com/MohammedFadin/ParentalControlSD_Fixer/blob/master/app.js

sunshowers commented 2 years ago

Would you be ok making a spindump of parentalcontrolsd and sending it to my email rain@sunshowers.io? There's an Apple employee that's willing to look: https://twitter.com/Catfish_Man/status/1534577710172778497

Here's how to create a spindump (create one for parentalcontrolsd in this case): https://support.zoom.us/hc/en-us/articles/206435255-Creating-a-Spindump-file-on-macOS

Jonas-Heinrich commented 2 years ago

Hi @sunshowers,

I work on the same team as Johannes and am also affected. I executed the gatekeeper commands you mentioned above, but it unfortunately does not help.

I also created a spindump of parentalcontrolsd and might be able to share it with an Apple employee directly. Since it is related to an internal project though, there would be some bureaucracy involved..

Instead I created this repository which reproduces the issue in isolation and was also run on x86-64 Linux. The main summary:

cargo test cargo nextest run
Darwin real 1.056s real 6.790s
Linux real 1.207s real 0.449s

So if one has cargo and cargo-nextest installed this allows for a deeper investigation anyway. An additional disclaimer: we are experiencing this on a corporate image of MacOS, so perhaps a custom IT policy is interfering. We would appreciate it very much if someone could tell us if they were able to reproduce this on their M1 machine.

Kind regards, Jonas

cc: @Catfish-Man

sunshowers commented 2 years ago

Thanks for the minimized test case! I do think this is generally an issue with macOS process execution performance. It could definitely be a custom IT policy that e.g. deployed an endpoint security extension: https://developer.apple.com/documentation/endpointsecurity

Most such extensions will intercept each process execution. If they don't respond quickly enough, nextest will slow down.

I don't have a suitable Mac computer to test with but perhaps you or someone else does?

quadruple-output commented 2 years ago

I work on the same team as Johannes and Jonas :)

I did some experimenting, and I am fairly confident that the issue is caused by parentalcontrolsd, or some kind of extension called by this process (but I think I can rule out endpoint security extensions because I deinstalled them - see below).

This is what I tried:

I started our test suite in a terminal and ran sysdiagnose -p <PID-of-parentalcontrolsd in another terminal while the tests were running. This created a .tar.gz file with 500Mb of log data. I greped for parentalcontrolsd and found many hits in a file named fs_usage.txt. From this file I extracted all lines with string parentalcontrolsd into a separate file and inspected the result.

I found 48.6k calls to getdirentries64 within a time span of 11 seconds. I cannot tell whether this really is "a lot", but it made me suspicious. The calls to getdirentries64 where mostly related to the same folder which happens to be the folder where the compiled test programs are located: <rust-project>/target/debug/deps.

When I checked, this folder contained more than 5.6k files, mostly .o and .d files. After I deleted these files (rm *.o *.d) and re-ran the test suite, it was much faster. For the test repository provided by Jonas, folder target/debug/deps only contains 382 files, but the performance gain from deleting all .o and .d files is still significant: The run time of cargo nextest run goes down from 4.7 seconds to 1.4 seconds on my machine (M1 Pro 2021).

Hypothesis:

I suspect that the performance regression is caused by parentalcontrolsd. It seems that parentalcontrolsd scans the containing folder of any program which is started. For folders containing many files, this has a significant performance impact.

When I execute systemextensionsctl list on my corporate laptop, I see one entry with bundleID "com.jamf.protect.security-extension". I can temporarily deinstall Jamf Protect, which removes the security-extension, and even without the security-extension, the observations I described above are the same: reducing the number of files in the test programs' foler significantly reduces the runtime of the test suite. I do not know if parentalcontrolsd may trigger other kinds of extensions, but if this is not the case, parentalcontrolsd itself seems to be the culprit.

sunshowers commented 2 years ago

Thank you so much for the investigation @quadruple-output! It does look like an issue with Apple's software at this point. This sort of linear scan through a directory each time a program is started is bound to cause performance issues.

It's possible that thanks to your investigation this can be fixed upstream, with Apple, so thanks again.

sunshowers commented 2 years ago

I've forwarded the report over, thanks!

One way nextest can work around this is to do a simplified version of reusing builds:

  1. Have a separate option under the REUSE BUILD OPTIONS called e.g. --hardlink-binaries (make it incompatible with the other options in this section if it aids in simplicity)
  2. After the build is done, rather than archiving binaries, hardlink them into a separate location under target/nextest (virtually always guaranteed to be on the same filesystem).
  3. Reuse the target-dir-remap feature to point test runs towards this other directory.

We'd welcome contributions if you'd like to implement this -- it is probably 1 day worth of work including integration tests.

quadruple-output commented 2 years ago

We have news that may be relevant for the report you sent upstream…

We learned from our IT department that they recently enabled a new mobile device management (MDM) configuration profile (I hope I am using the correct terms here). The profile contains a specific list of application paths that are not allowed to be launched on our managed endpoints. As far as I understand, this is achieved by leveraging a standard MDM feature, though this feature is marked as "deprecated".

Our IT colleagues would like to open an Apple ticket on their own. It would be great if you could share the ID of yours, so we can link the issues.

sunshowers commented 2 years ago

@quadruple-output thanks for waiting! sorry, just got the details: it's rdar://95625035

sunshowers commented 2 years ago

I'm wondering, could you give it a shot again with cargo-nextest 0.9.43 and macOS Ventura? Nextest now uses posix_spawn rather than fork/exec on Mac, which could potentially have made this better. (If building locally, make sure it's with Rust 1.65)

quadruple-output commented 2 years ago

I don't think that I have the possibility to re-test this, I'm afraid. As stated above, the issue was triggered by a configuration profile that had been rolled-out by our central IT team. After our complaints, they reverted this profile, such that the issue does not occur, any more. After all, I don't see regard this as a bug in nextest. Maybe we can close this issue (@johamster ?).

sunshowers commented 2 years ago

Thanks for getting back! I'll also ask my Apple contact about the status.