status-im / status-desktop

Status Desktop client made in Nim & QML
https://status.app
Mozilla Public License 2.0
287 stars 78 forks source link

[Windows] Optimise the windows build process #14858

Open alexjba opened 3 months ago

alexjba commented 3 months ago

Description

A clean build on windows takes more than 30 mins (vs. 5 on MacOs). Something could be wrong in the build configuration on windows.

The purpose of this task is to analyse and optimise the windows build process to reduce the build time.

Reproduced on AMD ryzen extreme z1 with M2 SSD and 16 GB LPDDR5X

Motivation

This optimisation would have a big impact on the Jenkins jobs running on every PR and would also help windows devs.

Screenshot 2024-05-20 at 14 18 01
alexjba commented 3 months ago

Adding both ui-team and backend-team labels. I guess any team could take this

@noeliaSD @jrainville Feel free to update the tags if needed

alexjba commented 3 months ago

@Seitseman @alaibe You're the only windows devs I know of. Do you also experience this slow build process?

alaibe commented 3 months ago

Adding @Cuteivist

jrainville commented 3 months ago

I know there are some ugly hacks on Windows where we need to clean everything before each run, because otherwise subsequent jobs would fail. @jakubgs probably know more.

jakubgs commented 3 months ago

Indeed we do wipe the whole workspace after every build due to various issues: https://github.com/status-im/status-desktop/blob/146a6e85018bd7e8cf9a68735296a8ca2220faa8/ci/Jenkinsfile.windows#L121-L124

But checkout takes around 3 minutes only:

image

What takes about ~8 minutes is make update:

07:04:53  + make update
07:08:18  Building: Nim compiler
07:12:13  + make deps
07:12:16  + make status-go

The build of the package itself takes about 12 minutes, which is about the same as on other platforms.

So the issue is in checking out all the submodules, not in the build.

jakubgs commented 3 months ago

In general I see two possibilities of improving the performance of Windows builds:

  1. Reinstall Windows CI slaves by using ReFS which has better performance than NTFS.
  2. Attempt to upgrade Git from our locked 2.28.0 version due to obsure build issue from 2021 to 2.45.2.

Both have a possibility of improving Git checkout performance, while first option could improve build performance itself.

jakubgs commented 3 months ago

Actually, the locking has broken down a while ago on some hosts:

 > a ci-slave-windows -m win_shell -o -a 'git version' 
windows-03.he-eu-hel1.ci.devel | CHANGED | rc=0 | (stdout) git version 2.41.0.windows.1
windows-01.he-eu-hel1.ci.devel | CHANGED | rc=0 | (stdout) git version 2.28.0.windows.1
windows-02.he-eu-hel1.ci.devel | CHANGED | rc=0 | (stdout) git version 2.37.1.windows.1
windows-01.he-eu-hel1.ci.release | CHANGED | rc=0 | (stdout) git version 2.37.3.windows.1

I will attempt to upgrade on one host to see if it breaks anything as a first step.

So we might as well just drop that fix and upgrade all to a recent 2.45.2 version.

jakubgs commented 3 months ago

I have opened an issue with Hetzner to inquire about reinstallation with separate ReFS partition:

Hello,

I'd like to know if it would it be possible to reinstall our Windows Server 2019 hosts with a different partition layout?

We have found that when we use them for CI builds the Git performance is very bad, and we were thinking employing ReFS would help with this:

- https://github.com/MicrosoftDocs/windowsserverdocs/blob/main/WindowsServerDocs/storage/refs/refs-overview.md
- https://geekflare.com/implement-refs-file-system-in-windows-server/
- https://learn.microsoft.com/en-us/windows/dev-drive/

Would it be possible to do a layout where ~50 GB are allocated for C:/ system partition on NTFS and the rest for a D:/ data partition on ReFS?
Would this be possible while maintaining the RAID1 layout?

Cheers

Ticket: #2024060603027071

jakubgs commented 3 months ago

I have performed a filesystem benchmark on windows-03 to have for future comparison with ReFS filesystem:

image

This is not a great comparison since this is a C: drive that is used by the system.

jakubgs commented 3 months ago

Hetzner support gave me actually useful information after some probing:

Windows as activated via robot will always use the whole disk space of DISK0 for the system partition (aka C:). In case your server has 2 hardddisks of the same size a mirror will be set up on DISK1.

You can request to perform a individual and manually configured partition layout. In order to do this please don't activate a installation via robot and perform a support request instead asking for a custom windows installation to be activated for you - our support needs to get confirmation for the deletion off all data on the respective server, so please also add those details to your request.

As a reply you'll receive a booted windows installation and kvm access with which you can set up the partitioning - the remaining installation beside the partitioning phase will still be unattend and does not require any further interaction from your side. Our installations auto activate during deployment so you don't need to consider any licensing related topics there.

ISOs in general can be found at our mirror (only accessible from within the hetzner network):

http://mirror.hetzner.com/bootimages/windows/

Note that those isos are vanilla and do not include a license. Licensing a vanilla installations afterwards is not supported - so if you want to use a hetzner licensed windows the process needs to be as described above.

Last not least:

Since you're mentioning ReFS you may be already aware, that it can't be used on a system partition and needs to be set up afterwards. Therefore another option would be to either reinstall automatically with the default hetzner partition setup or to not reinstall at all but just to disable the mirror and repartition the second harddisk afterwards, using ReFS by creating a respective storage pool. You may also online resize the C:\ partition and use the freed space in the same way.

So we have a few options:

jakubgs commented 3 months ago

And indeed, if we go to Computer Management and then Disk Management we can see that the NTFS partition is mirrored:

image

And the context menu does have some options to disable mirroring:

image

Not sure how those two are different though.

jakubgs commented 3 months ago

Interestingly the mirror is not enabled on windows-03, most probably due to some kind of misconfiguration during instllation:

image

Which makes it a perfect candidate to test ReFS.

jakubgs commented 3 months ago

After removing the confusing D:, E:, and F: drives I can create a ReFS volume:

image

jakubgs commented 3 months ago

Side-by-side comparison of clone performance of status-desktop on the same device with NTFS and ReFS filesystems using:

time git clone --recurse-submodules https://github.com/status-im/status-desktop.git
Filesystem Recursive Clone Time
NTFS - System volume 2m11.168s
NTFS - Dedicated volume 1m58.713s
ReFS - 4K unit size 1m56.189s
ReFS - 64K unit size 1m57.062s

It appears at least in this test there's no big difference between NTFS and ReFS, but there is a big difference between using a system partition and a dedicated partition. But a real test would involve Jenkins running the whole build on ReFS.

jakubgs commented 3 months ago

Apparently:

What is the difference between the ‘Break Mirror’ vs the ‘Remove Mirror’ option?

The “Break Mirror” operation, will stop the mirroring on the selected volume, without affecting the data on any disk. (Data will remain untouched on both disks).

The “Remove Mirror” operation, will stop the mirroring on the selected volume and destroys all the data on the mirror disk. (Data will remain only on one disk).

https://www.bulldogtechinc.com/2021/08/05/how-to-remove-or-break-hard-drive-mirror-on-windows-7-8-10-os/

jakubgs commented 3 months ago

I have ran a status-desktop build from master on windows-03 after updating the node configuration: https://ci.status.im/job/status-desktop/job/systems/job/windows/job/x86_64/job/package/662/

And the results don't appear to show any difference for dedicated ReFS:

image

I will try with dedicated NTFS volume.

jakubgs commented 3 months ago

NTFS makes little not no difference:

image

jakubgs commented 2 weeks ago

I think we should upgrade Windows slaves to match Linux ones, as currently they are smaller, AX41 instead of AX61:

image

image

Should speed up the builds by at least a bit. In addition to splitting the mirror setup.

jakubgs commented 2 weeks ago

Actually, even just switching from AX41 to AX42 should make a big difference in terms of CPU power:

image

https://www.cpubenchmark.net/compare/3481vs6001/AMD-Ryzen-5-3600-vs-AMD-Ryzen-7-PRO-8700GE

jakubgs commented 2 weeks ago

We should get one AX42, bootstrap it, disable mirroring of the volumes, get ReFS on the volume, and test build performance.

jakubgs commented 2 weeks ago

In order to bootstrap a fresh Windows Server 2019 Standard we use the following script: https://github.com/status-im/infra-tf-google-cloud/blob/master/setup.ps1 Originally this script was used to bootstrap Google Cloud Windows hosts like ths:

    /* Run PowerShell script for initial setup of a Window machine */
    sysprep-specialize-script-ps1 = (var.win_password == null ? null :
      templatefile("${path.module}/setup.ps1", {
        hostname = each.key
        domain   = var.domain
        password = var.win_password
        ssh_key  = var.ssh_keys[0]
      })
    )

https://github.com/status-im/infra-tf-google-cloud/blob/412c4802cd109a462a2fe862672283b9e8953c0e/main.tf#L139-L147

But for Hetzner the script will have to be modified to provide these four variables.

jakubgs commented 2 weeks ago

Once the script finishes you can finally connect over SSH using administrator user, and you can run full bootstrap: https://github.com/status-im/infra-role-bootstrap-windows https://github.com/status-im/infra-ci/blob/9fb0f567b8c484e3f08341ba553d99d82dcb7546/ansible/bootstrap.yml#L53-L60

Once that is complete we can run roles specific to the CI slave configuration: https://github.com/status-im/infra-ci/blob/9fb0f567b8c484e3f08341ba553d99d82dcb7546/ansible/slaves.yml#L38-L45

jakubgs commented 2 weeks ago

In order to make the separate J: volume work for Jenkins user you'll also need to modify the jenkins_home variable:

jenkins_home: '{{ jenkins_os_homes[ansible_system|lower] }}{{ jenkins_user }}'

https://github.com/status-im/infra-ci/blob/9fb0f567b8c484e3f08341ba553d99d82dcb7546/ansible/roles/jenkins-slave-user/defaults/main.yml#L8

This probably should be initially only done in host_vars since we have a heterogeneous setup.

jakubgs commented 2 weeks ago

Lets call the new host windows-04 for now, we can always rename it.

markoburcul commented 2 weeks ago

I've ordered and received new AX42 host. Connected to the host with Remmina RDP. One important thing to notice is that your connection Security protocol negotiation needs to be set to TLS protocol security in the Advanced tab of your RDP connection settings: Screenshot from 2024-08-28 14-06-36

After connecting to the host I've ran the script, but prior to this there was an issue that script wasn't signed, so I had to resolve it by changing the execution policy: Screenshot from 2024-08-28 14-06-04

Now I'll proceed with bootstrapping the host.

markoburcul commented 2 weeks ago

Managed to bootstrap fully windows-04 node and Jenkins agent. I had some issues with bootstrap windows playbook and within infra-ci bootstrap playbook to configure windows slaves:

After successfully running these playbooks, I've ran the build but it failed because there is no cmake installed.

markoburcul commented 2 weeks ago

Okay, so actually there were issues with the status-desktop-setup role because the script failed silently:

Please try again or create a new issue by using the following link and paste your console output: https://github.com/ScoopInstaller/Extras/issues/new?title=inno-setup%406.3.3%3a+decompress+error

And apparently there is an [issue and workaround](https://github.com/ScoopInstaller/Extras/issues/13911) for this. After trying it out the script installed packages smoothly. I will add these steps to the role.

Another thing is [this block](https://github.com/status-im/infra-ci/blob/9fb0f567b8c484e3f08341ba553d99d82dcb7546/ansible/roles/status-desktop-setup/tasks/win32nt.yml#L13C1-L17C30)  which also fails silently:

changed: [windows-04.he-eu-hel1.ci.devel] => { "changed": true, "cmd": "scoop hold git", "delta": "0:00:00.544819", "end": "2024-08-29 08:59:30.253400", "rc": 0, "start": "2024-08-29 08:59:29.708581" }

STDOUT:

ERROR 'git' is not installed.

But if we do `scoop list` on the host, we will see that git is actually installed:

$ scoop list Installed apps:

Name Version Source Updated Info


innounp-unicode 1.72 versions 2024-08-29 10:54:07 ntop 0.3.4 main 2024-08-29 10:09:17 7zip 24.08 main 2024-08-28 13:49:13 Global install ag 2.2.5 C:\Users\Administrator\scoop\buckets\main\bucket\ag.json 2024-08-28 17:16:44 Global install cacert 2024-07-02 main 2024-08-29 09:02:59 Global install cmake 3.30.2 main 2024-08-29 10:05:11 Global install cmder 1.3.25 main 2024-08-28 17:16:10 Global install dark 3.14 main 2024-08-28 13:48:53 Global install dd 0.6beta3 main 2024-08-28 17:16:11 Global install diffutils 3.6 C:\Users\Administrator\scoop\buckets\main\bucket\diffutils.json 2024-08-28 17:16:45 Global install dos2unix 7.5.0 C:\Users\Administrator\scoop\workspace\dos2unix.json 2024-08-28 17:16:45 Global install findutils 4.4.2 C:\Users\Administrator\scoop\buckets\main\bucket\findutils.json 2024-08-28 17:16:45 Global install firefox 129.0.2 extras 2024-08-28 17:16:18 Global install gcc 13.2.0 main 2024-08-29 10:58:40 Global install git 2.46.0 main 2024-08-28 13:49:24 Global install ...


and we should actually be using `scoop hold --global git`.
markoburcul commented 2 weeks ago

After finally being able to run the build, the results are not as expected. The build time is approximately the same: average build is 24min on AX41 and I got 22min on AX42. Screenshot from 2024-08-29 12-40-19

I was also inspecting the load on the host, but nothing much was happening. The resources were under utilized for the whole time. This screenshot is from the Package stage of the build. Screenshot from 2024-08-29 10-21-25

I don't think vertical scaling of the host will bring that much of a speedup.

markoburcul commented 2 weeks ago

Also, the failing build which failed with an error:

11:33:32  Warning: Cannot find Visual Studio redist directory, C:\Program Files (x86)\Microsoft Visual Studio\2017\BuildTools\VC\redist.
...
11:33:32  mv: cannot stat 'tmp/windows/dist/Status/bin/vc_redist.x64.exe': No such file or directory
11:33:32  make: *** [Makefile:765: pkg/StatusIm-Desktop-240829-091632-69cdd3-nightly-x86_64.exe] Error 1

happened because it was looking at the wrong directory for the redist. This problem occured because we set the env variable VCINSTALLDIR here and this is executed after we run the Jenkins agent executable so it doesn't have the required environment variable. If we look at the order of the roles that are executed for configuring Windows slaves agents, we can see that windows-jenkins-agent is executed before status-desktop-setup here. After restarting Jenkins agent on the host, the issue is gone.

markoburcul commented 1 week ago

I've tested the status-desktop build by modifying the parameters here and adding flag --jobs with value equal to the number of vcpus of the machine. The running time gets better by approximately 1min:

Maybe it would be worth adding the same flag to the nimbus build system

P.S. I've discovered that the function from jenkins lib utils.getProcCount() returns wrong number of CPU's. We are using this to set MAKEFLAGS env variable where the value is used for -j flag, we should fix this.

jakubgs commented 1 week ago

Yes, we should fix utils.getProcCount() to maybe call something like nproc.

As for --jobs flag for Git, definitely a good idea. But I would also like to see if we can improve windows performance with Nim compiler flags like --threads.

markoburcul commented 1 week ago

Here are the results with different parameter values for --jobs submodules flag, -j make flag and nim compiler --threads flag:

Now the deps and package steps seem to be longer than yesterday. I was looking at some more flags for optimizations but couldn't something that hasn't already been set within Makefile.

One thing that I don't understand is usage of nimcache. It is stored within the repo that is being cloned at start of the build and when build finishes it is wiped out along with the whole repo. Is it supposed to be used as a cache between different build stages? Would it be an option to define the cache somewhere else where it would persist between builds and then be used accross builds?

jakubgs commented 1 week ago

We have had bad behavior during builds as a result of Nim cache, hence it is not preserved between builds. In general it is safest to not use cache in between builds to make sure that every build is separate and is not affected by things that are not in the repo.

Now, if use of Nim cache can make a BIG difference on Windows we could re-consider, if it's big enough that it might be worth taking on the risks associated with using even per-PR Nim cache.

markoburcul commented 5 days ago

The suggestion from Nimbus team to disable all filters on a drive that Jenkins agent uses (J:) is not available as suggested in the article, but there was an explanation how these commands translate into registry entries, so I added these entries:

1. devdrv enable -> FsEnableDevDrive=1 in CCS\Control\FileSystem
2. disallowAv -> FltmgrDevDriveAllowAntivirusFilter=0 in CCS\Control\FilterManager

I've run CI build for windows, but there was no speed-up in the build time.